Discovering historical datasets for your research

Bruno Ponne ● 10 Jan 2023

  • Easy
  • Python
  • R
  • 6 min

What you will learn

  • Get to know good sources of historical data;
  • Be able to load data in R and Python;

Table of Contents


‘Information is not knowledge’
Albert Einstein

With so much data available nowadays, I frequently feel overwhelmed when I have to find data to study a subject. Is this dataset reliable? How was the data treated? Where can I find the codebook with detailed information on the variables? These are only some of my concerns. When it comes to historical data, the task can be even harder. In this lesson, you will learn about fascinating and reliable websites to find historical data for your research. Moreover, you will learn how to load data in Python and R.

Historical Data in Research - Where to find?

1. Harvard Business School

The Harvard Business School developed the project ‘Historical Data Visualization’ to foster the understanding of global capitalism throughout time. The page offers more than 40 datasets about a broad range of topics. For instance, you can find data on life expectancy, literacy rates or economic activity in several countries during the 19th and 20th century. Datasets are mostly in Excel format. Definitely worth a visit!

2. Human Mortality Database

Human Mortality Database (HMD) provides death rates and life expectancy for several countries over the last two centuries. Even though the platform requires a quick registration to give you access to the data, it is very complete and straightforward to understand. Datasets are in tab-delimited text (ASCII) files.

3. National Centers for Environmental Information

Would you like to study how climate has changed over the last centuries? Then this is an invaluable source for you! The National Centers for Environmental Information is the leading authority for environmental data in the USA and provides high quality data about climate, ecosystems and water resources. Data files can be downloaded in comma separated values format.

4. Clarin Historical Corpora

If you wish to work with text data, this is a valuable source of material. It offers access to ancient and medieval greek texts, the manifests wrote during the American Revolution, court proceedings in England in the 18th century and many other instigating materials. Files are usually provided in .txt format. The requirements to access files varies according to each case, since data comes from different institutions.

5. Slave Voyages

This impressive platform, supported by the Hutchins Center of Harvard University, gathers data regarding the forced relocations of more than 12 million African people between the 16th and 19th century. Files are provided in SPSS or comma separated values format.

6. HistData Package

This source is not actually a website but an R package. It provides a collection of 31 small datasets relating to several historical events over the last centuries. The package seeks to bridge the realms of history and statistics, offering tools to analyze historical problems and questions with statistical rigor. The Guerry dataset, for instance, provides social data from the 1830s French departments. Nightingale details the monthly number of deaths from various causes in the British Army during the Crimean War (1853-1856). There is a whole lesson about this package. Check it out: ‘Uncovering History with R - A Look at the HistData Package’

Coding the past: how to load data in Python

1. Pandas read_csv()

In this section, you will learn to load data into Python. You will be using data provided by the Slave Voyages website. The dataset contains data regarding 36,108 transatlantic slave trade voyages. Learn more about the variables here.

To load our data in Python, we will use Pandas, a Python library that provides data structures and analysis tools. The Pandas method read_csv() is the ideal option to load comma separated values into a dataframe. A dataframe is one of the data structures provided by Pandas and it consists of a table with columns (variables) and rows (observations). Bellow, we use the default configuration of read_csv() to load our data. Note that the only parameter passed to the method is the file path where you saved the dataset.

content_copy Copy

import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/historical_data/tastdb-exp-2019.csv")

2. Getting pandas dataframe info

A dataframe object is now created. It has several attributes or characteristics. For example, we can check its dimensions with shape and its column names with columns. Note that column names are the names of our variables. Moreover, you can also call methods, which, in general, carry out an operation to analyze the data contained in the dataframe. For example, the method describe() calculates summary statistics of each variable and head() filters and displays only the first n observations of your data. Check all Pandas DataFrame attributes and methods here.

Pandas DataFrame Object

Use the following code to check the dimensions and variable names of the dataset:

content_copy Copy

print("Dimensions: ", df.shape, 
      "Variable names: ", df.columns)

The attributes show that there are 276 variables and 36,108 observations in this dataset. Let us suppose you are only interested in the number of slaves disembarked (slamimp) in America per year (yearam). You could load only these two variables using the read_csv() parameter usecols. This parameter receives a list with variable names you wish to load. In larger datasets this parameter is very handy because you do not want to load variables not relevant to your study.

content_copy Copy

df = pd.read_csv("/content/drive/MyDrive/historical_data/tastdb-exp-2019.csv",
                 usecols=['YEARAM', 'SLAMIMP'])


0 290.0 1817
1 223.0 1817
2 350.0 1817
3 342.0 1817
4 516.0 1817

Now the dataframe is loaded only with the two specified variables. As said, Pandas dataframes offer tools to analyze the data, using DataFrame methods. Above, we use the method head() to display the five first observation in our dataframe. You can set how many observations head() should return through the n parameter (default is 5).

Moreover, we can use describe() to obtain summary statistics of our variables. From the summary statistics we can see that the earliest record is from the year 1514 and the latest one of 1886. Also, the maximum number of slaves traded in one voyage was 1,700.

count 34182.00 36108.00
mean 269.24 1764.33
std 137.32 59.47
min 0.00 1514.00
25% 177.00 1732.00
50% 261.00 1773.00
75% 350.00 1806.00
max 1700.00 1866.00

Coding the past: how to import a dataset in R

In R there are several functions that load comma separated files. I chose fread from the data.table library, because it offers a straightforward parameter to select the variables you wish to load (select). fread creates a data frame, similar to a pandas dataframe.

content_copy Copy


df <- fread("tastdb-exp-2019.csv", 
            select = c("YEARAM","SLAMIMP"))

To get summary statistics about your variables you can use the function summary(df). To view the n first observations of your dataframe, use head(df,n) as shown bellow. Summary and head produce very similar results to describe and head in Python.

content_copy Copy



More posts on how to find reliable data will be published soon!



There are currently no comments on this article, be the first to add one below

Add a Comment

If you are looking for a response to your comment, either leave your email address or check back on this page periodically.