Uncovering History with R - A Look at the HistData Package
What you will learn
- Understand the R HistData package, its various datasets, and how these datasets can enrich your research in the humanities and social sciences;
- Be able to load datasets and explore summary statistics;
- Be confident in visualizing relations and quantities with HistData and ggplot2;
- Explore the Nightingale dataset and replicate a historical visualization using ggplot2.
Table of Contents
- Data source
- Coding the past: exploring HistData
“Historians offer us systems of the past that are too complete, series of causes and effects that are too exact and too clear to have ever been entirely true.”
Marguerite Yourcenar - Mémoires d`Hadrien (1974)
Greetings, humanists, social and data scientists! Are you curious about how data analysis can enrich your research and understanding? Look no further! Today, we explore the world of historical data analysis using R’s powerful package: HistData. This package contains a collection of more than 30 datasets that can be used to explore historical trends and patterns.
HistData is an R package that provides a collection of 31 small datasets that are part of a program of research known as statistical historiography, that is, “the use of statistical methods to study problems and questions in the history of statistics and graphics” (Friendly, 2021). They can, of course, be used to study other topics in the humanities and social sciences. Thank you to the authors Michael Friendly, Stephane Dray, Hadley Wickham, James Hanley, Dennis Murphy, Peter Li for this wonderful compilation of datasets! Here are some of the data included.
|Arbuthnot||Arbuthnot’s data on male and female birth ratios in London from 1629-1710|
|Armada||The Spanish Armada|
|Bowley||Bowley’s data on values of British and Irish trade, 1855-1899|
|Cavendish||Cavendish’s 1798 determinations of the density of the earth|
|ChestSizes||Quetelet’s data on chest measurements of Scottish militiamen|
|Cholera||William Farr’s Data on Cholera in London, 1849|
|CushnyPeebles||Cushny-Peebles data: Soporific effects of scopolamine derivatives|
|Dactyl||Edgeworth’s counts of dactyls in Virgil’s Aeneid|
|DrinksWages||Elderton and Pearson’s (1910) data on drinking and wages|
|Fingerprints||Waite’s data on Patterns in Fingerprints|
|Galton||Galton’s data on the heights of parents and their children|
|GaltonFamilies||Galton’s data on the heights of parents and their children, by family|
|Guerry||Data from A.-M. Guerry, "Essay on the Moral Statistics of France"|
|HalleyLifeTable||Halley’s Life Table|
|Jevons||W. Stanley Jevons’ data on numerical discrimination|
Coding the past: exploring HistData
1. How to install the package in R?
To get started with HistData, you will first need to install and load the package into your R environment. We will additionally load other necessary libraries. You can do this using the following commands:
You can access descriptions for each dataset using the
help(DataSet) command. Moreover, running
example(DataSet) will, in most cases, demonstrate applications similar to their historical use.
2. How to load and rename a dataset from HistData?
To demonstrate HistData’s capabilities, we will use the
Nightingale dataset. This dataset contains the monthly number of deaths from various causes in the British Army during the Crimean War (1853-1856). The data was collected by Florence Nightingale, a British nurse who became famous for her work in the Crimean War. She was also a pioneer in the use of data visualization to communicate information.
To load the
Nightingale dataset, we use the
data() function. It will be automatically named
Nightingale in our environment. However, we can load it into a dataframe with a different name, such as
df, using the following command:
To check the structure of the dataset, use the
str(Nightingale). This will show you the number of observations and variables, as well as the type of data in each column. You will see that the dataset has 10 variables:
Other.rate. There are 24 observations for each variable. We will focus on the following variables:
Date: the date of the observation;
Year: the year of the observation;
Disease: the number of deaths from preventable or mitigable zymotic diseases;
Wounds: the number of deaths directly from battle wounds;
Other: the number of deaths from other causes.
3. Explore the dataset
First, let’s create a new variable
Total with the total amount of deaths per period. We can do this by adding the
Other variables. This is done with the
mutate() function available in the
dplyr package. We can then use the
summarise() functions to calculate the average number of deaths per year.
We can see that the average number of deaths was 688 in 1854; reached a peak of 967 deaths in 1855; and decreased in 1856.
To visualize the trend of deaths over time, we can use
geom_line() function from the
ggplot2 package. The
Date variable should be mapped to the x-axis and the
Total variable to the y-axis. We can use the
labs() function to add a title and labels to the x and y axes. Note that
theme_coding_the_past() is a custom theme that I created in the lesson ‘Climate data visualization’ to make the plot match the blog theme. You can use the default theme or create your own.
Florence Nightingale hypothesized that deaths in war hospitals were more frequently caused by poor sanitary conditions than by the war injuries themselves. As a result of Nightingale’s reports and persistent advocacy, a Sanitary Commission was dispatched in March 1855 to enhance hygiene standards, improve ventilation, and introduce preventive measures such as handwashing. To evaluate whether the death rates declined following the arrival of the Sanitary Commission, Florence took a progressive approach for her time. She analyzed and visualized data!
4. Florence’s approach: a Coxcomb chart
To better understand how the number of deaths evolved during the war period, Florence utilized a Coxcomb plot. This plot is similar to a pie chart, but all sectors have equal angles, differing in how far they extend from the center of the circle.
The figure below presents the Coxcomb plot that Florence created. The plot illustrates the number of deaths per month and cause. The radius of each sector is proportional to the number of deaths. There are 12 sectors, each representing a month of the year, starting from the left (April 1854) and proceeding in a clockwise direction until March 1855, thus completing the circle. Each sector is further divided by color, indicating the cause of death. Florence split the data into two different visualizations: one for before the arrival of the Sanitary Commission (plot on the right), and one for after (plot on the left).
In the next sections, we will replicate Nightingale’s Coxcomb plot using the
ggplot2 package. Before that, we will use the
pivot_longer() function from the
tidyr package to transform the data from a wide to a long format. This will allow us to visualize the trend of deaths by cause over time.
5. How to use R pivot_longer?
pivot_longer() function, from the
tidyr package, enables us to convert our data from a wide format to a long one. This process will create a new variable named
Cause with the values
Other, all of which were previously separate variables. The death counts corresponding to each cause will now be housed in a new variable labeled
cols argument specifies the variables to be transformed, while the
values_to arguments specify the names of the new variables. The figure below illustrates the transformation from wide to long format.
The following code first selects the relevant variables and then carries out the wide to long transformation. Moreover, we create one dataset for the data before the Commission’s arrival and another one for after the arrival. We will use these datasets to replicate Florence’s Coxcomb plot in the next section.
6. Replicate Florence’s Coxcomb plot with ggplot2
The code below replicates Nightingale’s Coxcomb plot.
Date is mapped to the x-axis and the
Deaths to the y-axis. The
fill parameter is dependent on
geom_bar() layer creates a stacked bar chart. The
scale_y_sqrt() function transforms the y-axis into a square root scale to better visualize the differences between the number of deaths by cause. Note also that the
limits argument guarantees that before and after y-axis will have the same scale: from 0 to 3000 deaths. Thus we can compare the two plots more easily.
coord_polar() converts the bar chart into a Coxcomb plot. The ´start´ argument sets the offset of starting point from “12 o’clock” in radians. We set it to 3*pi/2 (270°) to start at “9 o’clock” to replicate Florence’s choice. The
ggtitle() function adds a title to the plot. The
scale_fill_manual() function sets the colors of the sectors. The
theme() adapts the plot to fit the style of this page. The exact same plot is created for the data after the arrival of the Sanitary Commission.
In the plots below, we can confirm that the number of deaths from preventable diseases, in pink, is a lot larger before the arrival of the Sanitary Commission. On the other hand, the number of deaths from wounds and other causes, in blue and green, respectively, is smaller and relatively stable.
7. A line plot instead of a Coxcomb plot
And what if we used a line plot to analyze the same data? The code below employs geom_line to achieve that. Note that geom_vline adds a vertical line at the date of the arrival of the Sanitary Commission.
Once again, this plot illustrates that mortality decreased following the arrival of the Sanitary Commission. Notably, deaths from preventable diseases, represented by the pink line, experienced a significant reduction. It’s crucial to bear in mind, however, that before-and-after analyses do not provide definitive proof of causality. Nevertheless, the evidence showcased in both the line plot and the Coxcomb plot strongly indicates that the implemented measures likely drove the observed decrease in fatalities.
What are your thoughts? Do you prefer the Coxcomb plot or the line plot? In your opinion, which visualization is the most effective? I welcome your feedback in the comments below!
HistDatais a package that provides a collection of 31 small datasets that can be used to explore historical trends and patterns;
- A Coxcomb chart is similar to a pie chart, but all divisions have equal angles, differing in how far they extend from the center of the circle;
- Different plots can be used to visualize the same data. The choice of visualization depends on the type of data and the message you want to convey.