coding-the-past

Exploring the MET API with Python - Francisco Goya’s Artworks

2026-04-16T00:00:00+00:00

The act of painting is about one heart telling another heart where he found salvation.

— Francisco Goya

Francisco Goya is one of my favorite artists. His work has a beautiful darkness that tells a lot about his experience in his time. In this post, we’ll dive into his world using the Metropolitan Museum of Art (MET) application programming interface (API), which gives developers access to data on hundreds of thousands of artworks.

You will learn how to interact with the MET API using Python. We will journey through the process of making HTTP requests, parsing the returned JSON data into a structured pandas DataFrame, and exploring the collection to extract meaningful insights about Goya’s work.

1. Requesting data from the API

We begin by importing the requests library, which allows us to send HTTP requests to the MET REST API in Python. We’ll query the search endpoint to find Goya’s paintings. In API terms, an endpoint is a specific URL used to access a particular resource.

The MET API has four endpoints starting with “https://collectionapi.metmuseum.org/”:

GET /public/collection/v1/objects returns a listing of all valid objectID available to use.
GET /public/collection/v1/objects/[objectID] returns a record for an object, containing all open access data about that object, including its image (if the image is available under Open Access).
GET /public/collection/v1/departments returns a listing of all departments of the museum.
GET /public/collection/v1/search returns a listing of all objectID for objects that match the search query.

You can find more details about each endpoint and its functionality in the official MET API documentation.

tips_and_updates

A REST (Representational State Transfer) API is a set of rules used to communicate between your computer and the MET server using HTTP methods and endpoints. Note that many APIs require authentication; however, the MET API is public and does not require an API key.

content_copy Copy

import requests
import pandas as pd

search_query = "https://collectionapi.metmuseum.org/public/collection/v1/search?hasImages=true&q=Francisco Goya"

response = requests.get(search_query)
search_data = response.json()

print(f"Found {search_data['total']} artworks for Francisco Goya.")

API endpoints can be followed by query parameters that refine our search. In the example above, hasImages=true filters for objects with images, and q specifies our search term—in this case, the artist’s name.

The requests library contains a method called get(), which we use to send our request to the API, passing our endpoint saved in the string search_query.

The resulting response object can then be parsed into a JSON structure using the .json() method.

2. Converting JSON to a list of painting ids

While JSON is the standard for data exchange, working with raw JSON can be cumbersome for direct data analysis. In Python, you can think of JSON as a dictionary of keys and values. These values can themselves be other dictionaries, lists, numbers, strings, or booleans. By printing the search_data object, we can see that it’s a dictionary containing two main keys:

total: An integer representing the total number of objects returned.
objectIDs: A list containing the unique IDs of the artworks matching our search.

To retrieve the list of IDs associated with the key “objectIDs” we use the standard dictionary notation search_data["objectIDs"] and save it to the variable goya_ids.

content_copy Copy

print(search_data)
goya_ids = search_data["objectIDs"]

3. Getting the details of each of Goya’s works

To retrieve details for each artwork — such as its title, date, and thematic tags — we need to iterate through the list of IDs and send a request to the /objects/{objectID} endpoint for each item. We implement this using a for loop that repeats the request for each artwork.

(Note: Depending on the number of results, fetching these details can take a few minutes. We use time.sleep(1) to respect the API’s rate limits and avoid being blocked.)

content_copy Copy

import time

all_objects_data = []


for object_id in goya_ids:
    try:
        obj_response = requests.get(f"https://collectionapi.metmuseum.org/public/collection/v1/objects/{object_id}")
        obj_response.raise_for_status() 
        all_objects_data.append(obj_response.json())
    except requests.exceptions.RequestException as e:
        print(f"Error for object ID {object_id}: {e}")
    
    time.sleep(1) # Respect the API, one request per second to be safe

# Convert the gathered data to a DataFrame
goya_df = pd.json_normalize(all_objects_data)

# Filter only Goya works
goya_df = goya_df[goya_df['artistDisplayName'].str.contains('Goya', na=False)]

We use a try-except block to ensure the loop continues even if a specific object ID fails to load. We also log any errors to help with debugging.

Finally, we convert the collected data into a Pandas DataFrame using pd.json_normalize. Since a broad search might return works about Goya or mentioning him in metadata, we filter the DataFrame to ensure the artistDisplayName actually contains “Goya.”

The resulting DataFrame contains intriguing data about each of his works, including name, year when the painting or drawing was started and finished, descriptive tags and dimensions, among other information. Feel free to explore it. We will continue working with the descriptive tags in the next steps.

4. Flattening nested JSON data

For keys whose values are lists or other dictionaries, the resulting columns will contain those respective objects. This happens, for example, with the tags column. When you have nested elements like this, you can “flatten” them into a tabular format.

JSON data structure

Flattening an element changes the granularity of the data. Whereas before each row represented a single artwork, in the flattened table each row represents an individual tag belonging to one artwork.

To flatten these nested tags, we can use json_normalize by specifying the element to unnest in the record_path. We also include the objectID in the meta parameter so we don’t lose the relationship between a tag and its original artwork. Later on, we can join this tags table back to our main DataFrame if we want.

content_copy Copy

tags_df = pd.json_normalize(
    all_objects_data,
    record_path='tags',
    meta=['objectID']
)

5. Visualizing the most frequent themes

The MET API provides a tags field containing descriptive terms associated with each artwork. To understand the prevailing themes in Goya’s works — famous for documenting the social upheaval and dark realities of his era — we can extract these terms and calculate their frequency.

Once we isolate the individual tags into a new column, we can use matplotlib to create a horizontal bar plot of the top 10 terms and check if indeed his artwork contained themes related to death and misery.

content_copy Copy

import matplotlib.pyplot as plt

# Calculate the frequency of each term for the filtered Goya artworks
# We filter tags_df to only include IDs present in our filtered goya_df
term_frequency = tags_df[tags_df['objectID'].isin(goya_df['objectID'])]['term'].value_counts().reset_index()
term_frequency.columns = ['term', 'count']

# Select the top N terms for better readability if there are many unique terms
# For this example, let's take the top 10 terms
top_terms = term_frequency.head(10).sort_values(by='count', ascending=True)

plt.figure(figsize=(12, 8))
plt.barh(top_terms['term'], top_terms['count'], color='#FF6885')

plt.title('Top 10 Most Frequent Terms in Goya Dataset', fontsize=20)
plt.xlabel('Frequency', fontsize=16)
plt.ylabel('Term', fontsize=16)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.tight_layout()
plt.show()

Top 10 Most Frequent Terms chart

The resulting visualization provides a fascinating window into Goya’s thematic world. Beyond common subjects like “Men,” “Women,” and “Portraits,” we see a strong representation of “Bulls” (reflecting his famous Tauromaquia series) and “Self-portraits.”

Most strikingly, terms like “Death” and “Suffering” appear prominently in the top 10. This data-driven insight confirms Goya’s historical reputation as an artist who didn’t shy away from the darker aspects of the human experience. By quantifying these themes through the MET API, we move from subjective observation to empirical evidence of his artistic focus.

Plate 43 from "Los Caprichos": The sleep of reason produces monsters (El sueño de la razon produce monstruos)

You could also use the main dataset we created to collect a series of images of Goya works. I am thinking of using AI to help me download all images of Goya in the public domain and try to build a model to describe or classify them in Python. Feel free to use the data and let me know about your analysis. Leave your comments or any questions below and happy coding!

Conclusions

The requests library combined with pd.json_normalize makes extracting and structuring data from web APIs both seamless and efficient.
Navigating public collections like the MET API enables us to perform large-scale data analysis on historical and cultural artifacts.
Combining data extraction with clear visualizations (using Matplotlib) provides interpretable insights into an artist’s thematic legacy and creative focus.

Data Science Quiz For Humanities

2025-11-22T00:00:00+00:00

Test your skills with this interactive data science quiz covering statistics, Python, R, and data analysis.

T test in R

2025-09-21T00:00:00+00:00

In this post, you will learn what a T Test is and how to perform it in R. First, you’ll see a simple function that lets you perform the test with just one line of code. Then, we will explore the intuition behind the test, building it step by step with data about the Titanic passengers. Enjoy the reading!

1. What is a T-Test?

A t-test is a statistical procedure used to check whether the difference between two groups is significant or just due to chance. In this post, we’ll look at data from Titanic passengers, dividing them into males and females. Suppose we want to test the hypothesis that men and women had the same average age. If our data shows that women were, on average, 2 years younger than men, we need to ask: is this a real difference, or could it have happened randomly? The t-test helps us answer this question.

2. Why is a T-Test important?

A t-test is important when we want to draw conclusions about a population based on a sample. For example, imagine we are studying the demographics of ship passengers at the beginning of the twentieth century and want to use the Titanic sample to generalize findings to a broader population of passengers.

Of course, such inferences may be biased, since Titanic passengers might not perfectly represent all ship passengers of that era. Nevertheless, the sample can still provide valuable insights, as long as the context of both the sample and the population is carefully considered and clearly explained.

3. The Titanic passengers

We are going to use the titanic R library to access data about Titanic passengers. Specifically, we will work with a subset of passengers contained in the titanic_train dataset. Below, you will find the code to load the data, calculate the mean and standard deviation of age for males and females, and show how many passengers are men and women.

  content_copy
  Copy

    
library(titanic)  
data('titanic_train')
df <- titanic_train %>% 
    select(Sex, Age) %>% 
    na.omit()

df %>% group_by(Sex) %>% 
    summarize(mean(Age), sd(Age), n())
    

  

Sex	mean(Age)	sd(Age)	n
female	27.9	14.1	261
male	30.7	14.7	453

We can see that there is a difference of 2.8 years between the average age of men and women on the Titanic. Below, you can also check the distribution of ages.

  content_copy
  Copy

ggplot()+
  geom_density(aes(x=df$Age, color = df$Sex), size = 0.7)+
  scale_color_discrete("")+
  xlab("Age")+
  ylab("Density")

It seems indeed that the distributions are very similar. In this case, our best option is to carry a T Test out to see if they are really so similar.

4. T test in R

A T test can be performed in R in a very easy way. There is a function called t.test, whose first argument is a formula, in our case, we would like to know how age varies across different genders. Thomas Leeper wrote a very clear explanation about formulas in this page. Important for us is that the formula is composed by a dependent variable on the left (Age), followed by “~” and one or more independent variables on the right (Sex). The second argument is simply the dataframe with the data we want to test. This test assumes the two samples are independent and that age is approximately normally distributed, which we confirmed by the density plot above.

  content_copy
  Copy

t.test(Age ~ Sex, data = df)

How to interpret these results?

The p-value of 0.0118 means that if there were truly no difference in the average age between male and female passengers (i.e., if the null hypothesis were true), there would be only a 1.18% chance of observing a difference as large as the one we found or larger. Since this p-value is less than 0.05, we reject the null hypothesis at the 95% confidence level, suggesting that a real difference exists. However, if we had chosen a 99% confidence level, we would not reject the null hypothesis, because the p-value is greater than 0.01.
Our confidence interval tells us that if we took many samples like the one we have, in 95% percent of the times, we would obtain a difference between averages between -0.62 and -5. This confidence interval does not include 0 and therefore we reject the null hypothesis and accept the hypothesis that there is a difference between the average age of men and women.

5. T test with Bootstrap

A T test with bootstrap is a good way of understanding the concepts needed to interpret the results of the T test above. Everything relies on the Central Limit Theorem according to which if I draw many samples of a population and calculate the mean of each sample, then the distribution of all these means will:

(i) follow a normal distribution;

(ii) the mean of the sample means will approximate the population mean;

(iii) the standard deviation of this distribution will be called standard error.

In our example, we have one sample of passengers. Imagine we could collect many of those samples. If we could do that, then the means of all samples would approximate the population parameter. Bootstrap is a technique to virtually create as many samples as we want from our unique sample. In our example, we have 712 ages after eliminating NAs. We could resample 712 observations from these values allowing them to repeat. That is the basic idea behind bootstrapping.

In order to do that procedure, we will create a function that will resample our data frame. The first line of code uses slice_sample to randomly select n rows of our dataframe allowing for the same row to be chosen more than one time. Note that n is the number of rows of the dataframe. After that, we use dplyr to calculate the mean by gender. Note that we are actually interested in the difference between the male mean and the female mean. That’s what the two last lines of code do.

  content_copy
  Copy

    
diff_means <- function(data) {
    sample_df <- data %>% slice_sample(n = nrow(data), replace = TRUE)
    means <- sample_df %>%
        group_by(Sex) %>%
        summarize(mean_age = mean(Age, na.rm = TRUE))
    
    male_mean <- means %>% filter(Sex == "male")   %>% pull(mean_age)
    female_mean <- means %>% filter(Sex == "female") %>% pull(mean_age)
    return(male_mean - female_mean)
}
    

  

Now we can use the replicate function to execute our function for n times. For our purpose 1000 times is enough. Note that replicate works like a for loop. Before we do that, however, let us make a small adjustment so that we can also calculate our p-value. The p-value assumes the null hypothesis is true. Therefore, before resampling our data, let us make the difference between means be 0. For that, let us subtract the difference observed, 2.81, from the ages of all males.

  content_copy
  Copy

    
df_null <- df %>% 
    mutate(Age = ifelse(Sex=="male", Age-2.81, Age))
    
set.seed(1308)
diffs <- replicate(1000, diff_means(df_null))

sd(diffs)
mean(diffs)

ggplot()+
    geom_histogram(aes(x = diffs), color = "white", fill = "#2E3031")+
    geom_vline(xintercept = -2.8, color = "#A33F3F")+
    geom_vline(xintercept = 2.8, color = "#A33F3F")+
    scale_color_discrete("")+
    xlab("Age Differences (Null Hypothesis)")+
    ylab("Number of Individuals")+
    theme_bw()
    

  

Executing the commands above we get that the mean of the sampling distribution - as the distribution of the sample means is called - is approximately 0, as expected, and its standard deviation is 1.1.

The histogram above shows us how the sample differences would look like if the null hypothesis were true. The red lines show the difference we observed in reality. Do you think it is likely to observe what we observed under the null hypothesis? It is actually not and you can calculate it with the code below:

  content_copy
  Copy

sum(diffs>=2.81)/1000
sum(diffs<=-2.81)/1000

The code computes the number of samples whose means were more extrem than 2.8 (male age - female age) or -2.8 (female age - male age). This results in 9 samples out of 1.000, or 0.9%. This estimate is very close to the p-value found using the R function t.test. Again we can reject the null hypothesis and conclude that there is a difference between the average age of men and women.

In addition to helping us better understand the test, the bootstrap method has the advantage of not assuming that the age distribution follows a normal distribution. This is another benefit of using this approach.

Please, use the comments below if you did not understand a specific point of the test or if you have a suggestion to improve the test.

From R to Tableau - Leverage Both Tools for Effective Dashboards

2025-07-06T00:00:00+00:00

When the violence causes silence, we must be mistaken.

Zombie, The Cranberries (1994)

Data analysis can be more than quarterly KPIs or complicated statistical models — it can help us remember and critically retell our past. While Latin America is often viewed as a peaceful region, the second half of the 20th century saw several brutal authoritarian regimes. Chile’s dictatorship (1973‑1990) was among the most violent.

In this post, I show how I used an R package to obtain data about the victims of Chile’s dictatorship and visualize it in Tableau Public. You’ll also discover the strengths and limitations of each tool for dashboard creation.

1. The pinochet Package

Developed by Professor Danilo Freire and colleagues, the pinochet R package provides clean and tidy data on victims of the Chilean dictatorship. Each row in the dataset represents one individual.

content_copy Copy

install.packages("pinochet")

library(pinochet)

data(pinochet)  # loads the data in a data frame called pinochet

str(pinochet)   # explores the structure of the data frame

R excels at complex tasks — such as causal inference and statistical analyses — and it is equally powerful (and free) for data exploration and interactivity. With Shiny, you can build attractive dashboards entirely in R. However, mastering Shiny and producing polished interactive visuals with libraries like Plotly can take significant time and practice.

In this context, Tableau Public is an appealing option. It is the free edition of Tableau, designed for exploring public datasets and building engaging dashboards, while you learn. Tableau is a drag-and-drop tool that lets you create visualizations without writing code. As noted earlier, it is less versatile than R, but it is also easier to learn and use. In just a few hours, you can build beautiful exploratory dashboards using drag-and-drop alone. That’s why I chose Tableau to visualize this data. To bring the dataset into Tableau, I saved it as an Excel (.xlsx) file.

content_copy Copy

library(writexl)

write_xlsx(pinochet, "pinochet.xlsx")

2. Tableau Public

Tableau Public is a free, public platform for exploring, creating, and sharing data visualizations. It offers a more limited version of the well-known data visualization tool, Tableau.

Tableau, like ggplot2, has its roots in the Grammar of Graphics, a framework for understanding and creating visualizations. Within this framework, a plot is built by mapping data variables to visual aesthetics. In Tableau, this mapping is accomplished through drag-and-drop: you literally place fields onto the X-axis, Y-axis, Color shelf, and so on.

In contrast, ggplot2 mappings happen through code:

content_copy Copy

ggplot(data = df, aes(x = x, y = y, color = gender)) + geom_point()

You can download Tableau Public Desktop in the official Tableau webpage. When you open it, you can easily load the Excel file you saved from R by selecting a connection to Microsoft Excel or Text File (if you prefer to save it as a .csv)

Please, check out this tutorial to learn more about Tableau Public. You can also download my Tableau workbook, that contains the dashboard, and check out how I created the full dashboard. Don’t forget to leave a star if you enjoy it! 🙂

3. The Dashboard and Key Insights

3.1 The Dashboard

The dashboard is organized into four interactive sections:

1. Tough Years
A bar chart of victims per year.
Tip: Scrub over the bars to filter by year.

2. Occupation & Place of Disappearance
Treemap and map views.
Tip: Click an occupation to highlight where those victims disappeared.

3. An Exploratory Memorial
One star per confirmed victim.
Tip: Hover to read personal details.

4. Age & Gender
Histogram split by gender.
Tip: Hover bars to see counts; toggle genders in the legend.

3.2 Key Insights

Here are some insights from the dashboard:

1973 was the deadliest year, with ~1,230 victims during the coup.

Blue-collar workers made up almost half the victims, revealing a class dimension of state violence.

Students (university and school) accounted for nearly 13% of the disappeared — a stark cost of activism.

96% of the victims were male, but the women’s stories reveal deep family traumas.

Most victims were between 20–30 years old — showing how youth were disproportionately targeted.

No place was safe — from Santiago to remote mining towns, disappearances happened everywhere.

4. Conclusions and Limitations

Tableau is a user-friendly tool for creating visual dashboards — especially good for quick exploration and sharing. It supports traditional charts and maps, and its drag‑and‑drop interface is great for beginners.

However, it has limitations. It lacks advanced statistical tools and doesn’t support robust preprocessing or modeling tasks. That’s where R truly shines.

Used together, R and Tableau offer a powerful combo for data-driven storytelling.

Data Source: Freire, D., Mingardi, L., & McDonnell, R. (2019). pinochet: Data About the Victims of the Pinochet Regime, 1973–1990

Link to Tableau Public Dashboard

What other historical datasets would you like to see visualized? Share your ideas in the comments below!

My Journey Learning R as a Humanities Undergrad

2025-04-22T00:00:00+00:00

1. A Passion for the Past

Since I was a teenager, History has been one of my passions. I was very lucky in high school to have a great History teacher whom I could listen to for hours. My interest was, of course, driven by curiosity about all those dead humans in historical plots that exist no more except in books, images, movies, and — mostly — in our imagination.

However, what really triggered my passion was realizing how different texts can describe the same event from such varied perspectives. We are able to see the same realities in different ways, which gives us the power to shape our lives — and our future — into something more meaningful, if we so choose.

2. First Encounters with R

When I began my master’s in public policy at the Hertie School in Berlin, Statistics I was a mandatory course for both management and policy analysis, the two areas of concentration offered in the course. I began the semester certain I would choose management because I’d always struggled with mathematical abstractions. However, as the first semester passed, I became intrigued by some of the concepts we were learning in Statistics I. Internal and external validity, selection bias, and regression to the mean were concepts that truly captured my interest and have applications far beyond statistics, reaching into many areas of research.

The Hertie School Building. Source: Zugzwang1972, CC BY 3.0, via Wikimedia Commons

Then came our first R programming assignments. I struggled endlessly with function syntax and felt frustrated by every error — especially since I needed strong grades to pass Statistics I. Yet each failure also felt like a challenge I couldn’t put down. I missed RStudio’s help features and wasted time searching the web for solutions, but slowly the pieces began to click.

3. Discovering DataCamp

By semester’s end, I was eager to dive deeper. That’s when I discovered that as Master candidates, we had free access to DataCamp — a platform that combines short, focused videos with in-browser coding exercises, no software installation required. The instant feedback loop—seeing my ggplot chart render in seconds—gave me a small win every day. Over a few months, I completed courses from Introduction to R and ggplot2 to more advanced statistical topics. DataCamp’s structured approach transformed my frustration into momentum. Introduction to Statistics in R was one of my first courses and helped me pass Stats I with a better grade. You can test the first chapter for free to see if it matches your learning style.

DataCamp Method. Source: AI Generated.

tips_and_updates

The links to DataCamp in this post are affiliate links. That means if you click them and sign up, I receive a small share of the subscription value from DataCamp, which helps me maintain this blog. That being said, there are many free resources on the Internet that are very effective for learning R without spending any money. One suggestion is the HTML free version of "R Cookbook" that helped me a lot to deepen my R skills.: R Cookbook

4. Building Confidence and Choosing Policy Analysis

Armed with new R skills, I chose policy analysis for my concentration area—and I’ve never looked back. Learning to program in R created a positive feedback loop for my statistical learning, as visualizations and simulations gave life to abstract concepts I once found very difficult to understand.

5. Pandemic Pivot

Then the pandemic of 2020 hit, which in some ways only fueled my R learning since we could do little besides stay home at our computers. Unfortunately, my institution stopped providing us with free DataCamp accounts, but I continued to learn R programming and discovered Stack Overflow — a platform of questions and answers for R and Python, among other languages — to debug my code.

I also began reading more of the official documentation for functions and packages, which was not as pleasant or easy as watching DataCamp videos, which summarized everything for me. As I advanced, I had to become more patient and persevere to understand the packages and functions I needed. I also turned to books—mostly from O’Reilly Media, a publisher with extensive programming resources. There are also many free and great online books, such as R for Data Science.

Main Resources Used to Learn R. Source: Author.

6. Thesis & Beyond

In 2021, I completed my master’s degree with a thesis evaluating educational policies in Brazil. To perform this analysis, I used the synthetic control method—implemented via an R package. If you’re interested, you can read my thesis here: Better Incentives, Better Marks: A Synthetic Control Evaluation of Educational Policies in Ceará, Brazil. My thesis is also an example of how you can learn R by working on a project with goals and final results. It also introduced me to Git and GitHub, a well known system for controling the versions of your coding projects and a nice tool to showcase your coding skills.

7. AI as a resource to learn programming

Although AI wasn’t part of my initial learning journey, I shouldn’t overlook its growing influence on programming in recent years. I wouldn’t recommend relying on AI for your very first steps in R, but it can be a valuable tool when you’ve tried to accomplish something and remain stuck. Include the error message you’re encountering in your prompt, or ask AI to explain the code line by line if you’re unsure what it does. However, avoid asking AI to write entire programs or scripts for you, as this will limit your learning and you may be surprised by errors. Use AI to assist you, but always review its suggestions and retain final control over your code.

Key Takeaways

Learning R as a humanities major can be daunting, but persistence pays off.
Embrace small, consistent wins — DataCamp’s bite‑sized exercises are perfect for that.
Visualizations unlock understanding — seeing data come to life cements concepts.
Phase in documentation and books when you need to tackle more advanced topics.
Use AI to debug your code and explain what the code of other programmers does.
Join the community — Stack Overflow, GitHub, online books and peer groups bridge gaps when videos aren’t enough.

Ready to Start Your Own Journey?

If you’re also beginning or if you want to deepen your R skills, DataCamp is a pleasant and productive way to get going. Using my discounted link below supports Coding the Past and helps me keep fresh content coming on my blog:

Start Learning R on DataCamp with My Discounted Link

What was the biggest challenge you faced learning R? Share your story in the comments below!

geom_bar() in ggplot2 Explained - When to Use stat=’count’ vs stat=’identity’

2025-02-24T00:00:00+00:00

ggplot2 is a powerful and well-known data visualization package for R. But do you know what gg stands for? It actually refers to the Grammar of Graphics, a conceptual framework for understanding and constructing graphs. The core idea behind the Grammar of Graphics is that a plot consists of multiple layers.

The most well-known layers are geometries — the geometric forms that represent data in a plot — and aesthetic mappings, which connect data to specific visual properties. A lesser-known but equally important layer is the statistical layer, which transforms the original data to enable specific types of plots. This may sound complex at first, but it’s actually quite intuitive. In this lesson, we will explore how geom_bar() applies a statistical transformation to make bar plots simpler and more straightforward.

1. How does geom_bar work by default?

To exemplify geom_bar’s default behavior, we will use a dataset about Westminster inquests conducted between 1760 and 1799. These inquests document investigations into deaths that occurred under sudden, unexplained, or suspicious circumstances. To learn more, please visit the project webpage London Lives 1690-1800: Crime, Poverty and Social Policy in the Metropolis.

The first step is to load the data using read_tsv(), a function from the readr package used to read tab-separated values. The verdict variable tells us the conclusion of the investigation, which could be, for example that the death was a homicide or a suicide. To simplify our analysis we unify ‘suicide (delirious)’, ‘suicide (felo de se)’, and ‘suicide (insane)’ into a single category: ‘suicide’. We also filter out observations where the verdict or gender is missing.

content_copy Copy

library(readr)
library(dplyr)
library(ggplot2)

df <- read_tsv("wa_coroners_inquests_v1-1.tsv")

df_prep <- df %>% 
  filter(verdict != "-") %>% 
  filter(gender %in% c("m", "f")) %>% 
  mutate(verdict = recode(verdict, "suicide (delirious)" = "suicide",
                          "suicide (felo de se)" = "suicide",
                          "suicide (insane)" = "suicide"))

Each row of df_prep contains data about the investigation of one death, including the date, gender, and verdict. We would like to have a first overview about the verdicts to determine how many deaths were classified as homicide, suicide, accidental, etc. The default behavior of geom_bar makes it very easy to visualize this information:

content_copy Copy

theme_set(theme_bw()) # chooses a lighter ggplot2 theme: theme_bw()

ggplot(data = df_prep)+
    geom_bar(aes(x=verdict))

Why does this work if we mapped a categorical variable to x? Where does ggplot2 get the count for each cause of death? Well, every geometry in ggplot2 has an associated default statistical transformation that tells ggplot whether it should consider the raw input data or whether it should first transform the dataset and then plot it. In the case of geom_bar, the default stat is “count”. That means ggplot will create a second dataframe with the values of verdict and their respective frequency/count, as shown in the figure below.

As you can see, ggplot2 does this work for you. But what if your data has already been transformed? In that case, you need to explicitly set geom_bar(aes(x=verdict, y = count), stat = "identity"). If stat is set to “identity”, then ggplot takes the raw input data and does not perform any transformation. In that case, note that an x and y are necessary.

tips_and_updates

You can use the command `layer_data(plot = last_plot(), i = 1L)` to check out the data ggplot transformed for you. Use this command after the plot command. It will get the transformed data from the last plot, regarding i = 1L, or the first layer of our plot (geom_bar in this case).

2. How to reorder geom_bar?

One improvement we can make to our plot is to reorder the verdicts so that the most frequent one comes first. This can be done with the help of the forcats package. One of its functions, fct_infreq(), reorders a variable based on the frequency of its values (largest first).

content_copy Copy

ggplot(data = df_prep)+
    geom_bar(aes(x=fct_infreq(verdict)))

3. Stacked and percent stacked geom_bar

Imagine now that you would like to investigate how the verdicts compare across genders, highlighting the cases involving female individuals. This can easily be achieved by mapping gender to the fill aesthetics. The result is two bars on top of each other, one referring to male and other to female.

In the code below, we also make our plot more visually attractive by changing the colors, legend title, and labels. Moreover, we adjust the axis labels.

content_copy Copy

ggplot(data = df_prep)+
    geom_bar(aes(x=fct_infreq(verdict), fill = gender))+
    scale_fill_manual(name = "", values = c("#f79326", "gray"), labels = c("Female", "Male"))+
    labs(x = "", y = "Number of Cases")

The stacked bar chart above results from the default position = "stack" configuration. To better visualize the distribution of female and male cases for each cause of death (verdict), we can display the percentages instead of absolute counts. This approach makes it easier to see in which verdict category females have a higher proportion. To achieve this, you need to change position to position = "fill" in geom_bar().

content_copy Copy

ggplot(data = df_prep)+
    geom_bar(aes(x=fct_infreq(verdict), fill = gender), position = "fill")+
    scale_fill_manual(name = "", values = c("#f79326", "gray"), labels = c("Female", "Male"))+
    labs(x = "", y = "Percentage")

Now it is clearer that, among all causes of death, homicides have the highest proportion of women. Moreover, the smallest percentage of female cases corresponds to accidental deaths.

4. Use stat_bin to group observations by date

Further examining the data, you want to study how the proportion of suicide cases among women has evolved over time. One way to do this is by filtering only suicide verdicts and visualizing the proportion of female suicide cases across time. Since we have data spanning multiple years, it is a good idea to group them into bins and count the cases within each period. This can be done using stat_bin(), which works similarly to geom_bar() but groups data into bins.

Since our dataset is in a tidy format — where each row represents a single case — we can count the number of occurrences within a specific bin to determine how many cases fall into each time interval. That’s why we set x to doc_date, the date of the investigation. Additionally, we can specify the number of bins by setting a value for the bins parameter. In the code below, we set bins = 10. We also set color = “white” to create white borders around the bars. Apart from these modifications, the code remains the same as in the geom_bar() example above.

content_copy Copy

ggplot(data = df_prep_2)+
    stat_bin(aes(x=doc_date, fill = gender), 
             position = "fill", 
             bins = 10, 
             color = "white")+
    scale_fill_manual(name = "", values = c("#f79326", "gray"), labels = c("Female", "Male"))+
    labs(x = "", y = "Percentage")

The plot shows a slight decreasing trend in the proportion of female suicide cases between 1760 and 1800. It also highlights that, throughout the entire period, males accounted for at least 60% of suicide cases.

I would love to hear any feedback or suggestions for improving the plots above. Feel free to share your thoughts or ask any questions in the comments below! Happy coding!

Conclusions

geom_bar and stat_bin are powerful tools to depict frequencies of subgroups in your data;
The geom_bar stat and position parameters allow users to plot several kinds of bar plots, turning geom_bar into a versatile visualization tool.

**I would like to thank June Choe for this brilliant explanation about stat_layers in ggplot2. Also, thanks a lot to Sharon Howard for preparing this instigating dataset and for making it available.

How to calculate Z-Scores in Python

2024-11-28T00:00:00+00:00

If you’ve worked with statistical data, you’ve likely encountered z-scores. A z-score measures how far a data point is from the mean, expressed in terms of standard deviations. It helps identify outliers and compare data distributions, making it a vital tool in data science.

In this guide, we’ll show you how to calculate z-scores in Python using a custom function and built-in libraries like SciPy. You’ll also learn to visualize z-scores for better insights.

1. What is a z-score?

A z-score measures how many standard deviations a data point is from the mean. The formula for calculating the z-score of a data point X is:

\[Z_{X} = \frac{X - \overline{X}}{S}\]

Where:

\(Z_{X}\) is the z score of the point \(X\);
\(X\) is the value for which we want to calculate the Z score;
\(\overline{X}\) is the mean of the sample;
\(S\) is the standard deviation of the sample.

2. Python z score using a custom function

A custom function allows you to implement the z-score formula directly. Here’s how to define and use it in Python:

content_copy Copy

def calculate_z(X, X_mean, X_sd):
    return (X - X_mean) / X_sd

The function takes three arguments:

a vector X of values for which you want to calculate the z-scores, like a pandas dataframe column, for example;
the mean of the values in X;
the standard deviation of the values in X.

Finally, in the return clause, we apply the z-score formula explained above.

To test our function, we will use data from Playfair (1821). He collected data regarding the price of wheat and the typical weekly wage for a “good mechanic” in England from 1565 to 1821. His objective was to show how well-off working men were in the 19th century. This dataset is available in the HistData R package and also on the webpage of Professor Vincent Arel-Bundock, a great source of datasets. It consists of 3 variables: year, price of wheat (in Shillings) and weekly wages (in Shillings).

We will be calculating the z-scores for the weekly wages. First we load the dataset directly from the website, as indicated in the code below.

content_copy Copy

import pandas as pd

data = pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Wheat.csv")

print(data['Wages'].mean())
print(data['Wages'].std())

data["z-score_wages"] = calculate_z(data["Wages"], data["Wages"].mean(), data["Wages"].std())

The average weekly wage during the period was 11.58 Shillings, with a standard deviation of 7.34. With this information, we can calculate the Z score for each observation in the dataset. This is done and stored in a new column called “z-score_wages”.

If you check the first row of the data frame, you will find out that in 1565 the z score was around -0.9, that is, the wages were 0.9 standard deviations below the mean of the values for the whole period.

3. Python z score using SciPy

A second option to calculate z-scores in Python is to use the zscore method of the SciPy library as shown below. Ensure you set a policy for handling missing values if your dataset is incomplete.

In the code below, we calculate the z-scores for Wheat prices. If you look at the z-score summary statistics, you will see that the price of wheat varied between -1.13 and 3.65 standard deviations away from the mean in the observed period.

content_copy Copy

from scipy import stats

data["z-score_wheat"] = stats.zscore(data["Wheat"], nan_policy="omit")

data["z-score_wheat"].describe()

3. Visualising z scores

Below you can better visualize the basic idea of z scores: to measure how far away a data point is from the mean in terms of standard deviations. This visualization was created in D3, a JavaScript library for interactive data visualization. Click “See average wage” to see the averave wage for the whole period. Then check out how far from the mean each data point is and finally note that the z-score consists of this distance in terms of standard deviation.

4. Visualizing z scores with Matplotlib

The code below plots the wage z scores over time and shows them as the distance from the point to the mean, as demonstrated in the D3 visualization above. Please consult the lesson ‘Storytelling with Matplotlib - Visualizing historical data’ to learn more about Matplotlib visualizations.

content_copy Copy

# Calculate mean wage
mean_wage = data["z-score_wages"].mean()

# Create the plot
fig, ax = plt.subplots(figsize=(10, 6))

# Scatter plot of wages over years
ax.plot(data["Year"], data["z-score_wages"], 'o', color='#FF6885', label="Wage Z-scores", markeredgewidth=0.5)

# Add a horizontal line for the mean wage
ax.axhline(y=mean_wage, color='gray', linestyle='dashed', label=f"Mean Z-score = {mean_wage:.2f}")

# Add gray lines connecting points to the mean
for year, wage in zip(data["Year"], data["z-score_wages"]):
    ax.plot([year, year], [mean_wage, wage], color='gray', linestyle='dotted', linewidth=1)

# Customize the plot
ax.set_xlabel("Year")
ax.set_ylabel("Z-scores")
ax.set_title("Z-scores Over Time")
ax.legend()

# Show the plot
plt.show()

Have questions or insights? Leave a comment below, and I’ll be happy to help.

Happy coding!

Conclusions

A z score is a measure of how many standard deviations a data point is away from the mean. It can be easily calculated in Python;
You can visualize z-scores using traditional python libraries like Matplotlib or Seaborn.

Sentiment Analysis in R

2024-10-21T00:00:00+00:00

In this lesson on sentiment analysis in R, you will learn how to perform sentiment analysis using the sentimentr package. To demonstrate the use of the package, you will compare the sentiment in the speeches of Adolf Hitler and Franklin Roosevelt about the declaration of war by Germany against the United States in 1941.

tips_and_updates

These speeches are analyzed here strictly for research purposes. Read more about an academic project to make Hitler speeches available for research: Collection of Adolf Hitlers Speeches, 1933-1945

1. What is sentiment analysis?

Sentiment analysis or opinion mining consists of detecting the emotional tone of natural language. It works by assigning an emotion or emotional score to each word in a text. Some methods consider each word separately and others approach them in a wider context, for example, by evaluating their emotion considering its position in a sentence.

In this post we will be taking the latter approach, because the context of the word not rarely influences the emotion conveyed by it. In this context, the sentimentr package is a great option for sentiment analysis in R, because it calculates the sentiment at the sentence level. Each sentence is assigned a score that, in our example, varies from around -1.2 (very negative) to around 1.2 (very positive).

The sentimentr package takes into account valence shifters that can change the emotion of a sentence, for example:

negator: I do not like it.
amplifier: I really like it.
de-amplifier: I hardly like it.

tips_and_updates

Check the package repository if you are interested in the math behind the methodology: Rinker, Tyler W. 2021. sentimentr: Calculate Text Polarity Sentiment. Buffalo, New York.

2. How to get the data?

We will gather the data for this example from two webpages using web scraping. If you want to learn more about web scraping, please consult ‘How to webscrape in R?’. The rvest package will be used to webscrape, specifically, the following three functions:

read_html: Extracts the HTML source code associated with an URL;
html_elements: Extracts the relevant HTML elements from the HTML code;
html_text: Extracts the text (content) from the HTML elements;

The first step is to load the necessary packages and to save the URLs of the two speeches in variables. Please follow the instructions of the sentimentr package webpage to install it.

content_copy Copy

library(rvest) # for webscraping
library(tidytext) # for cleaning text data
library(dplyr) # for data preparation
library(ggplot2) # for data viz
library(sentimentr) # for sentiment analysis in R


url_h <- "https://en.wikisource.org/wiki/Adolf_Hitler%27s_Declaration_of_War_against_the_United_States"
url_r <- "https://www.archives.gov/milestone-documents/president-franklin-roosevelts-annual-message-to-congress#transcript"

If you inspect the source code of the webpages referenced above, you will realise that while the text from Wikipedia can be gathered by simply extracting the p elements, for the speech from the American archives, we need to specify the particular div element where the speech is located. This is because the webpage contains an initial section with several paragraphs introducing President Roosevelt’s speech. In the code below, note that Roosevelt’s speech requires an additional step to specify that the speech is within the div.col-sm-9 (a div with the class “col-sm-9”). Also, note that we exclude the first text element of Hitler’s speech because it is actually metadata about the speech.

content_copy Copy

# Webscraping Hitler´s speech
speech_h <- read_html(url_h) %>% 
    html_elements("p") %>% 
    html_text()

# Webscraping Roosevelt´s speech
speech_r <- read_html(url_r) %>% 
    html_elements("div.col-sm-9") %>% 
    html_elements("p") %>% 
    html_text()

# Excluding first text element of Hitler's speech, because it is meta data
speech_h <- speech_h[2:155]

3. Performing sentiment analysis in R with sentimentr

Our next objective is to further split each of the paragraphs of our speeches into sentences. This can be achieved with the get_sentences function from the sentimentr package. This function takes a character vetor, splits each element of this vector in sentences and delivers them in a list object. Each paragraph of our speeches becomes one list element that consists of a character vector containing the sentences of the respective paragraph.

content_copy Copy

sentences_h <- get_sentences(speech_h)
sentences_r <- get_sentences(speech_r)

Finally we can apply sentiment analysis to our sentences. We do that by using the sentiment function. It delivers a data frame containing:

element_id: identifies the paragraph;
sentence_id: identifies the sentence;
word_count: informs how many words the sentence has;
sentiment: informs the sentiment score attributed to that sentence;

In the code below we also check the most negative sentence in both speeches by ordering the data frames by sentiment (ascending) and getting the IDs of the sentences. Note that to access a sentence in the list, you use the following syntax: list[[element_id]][sentence_id].

content_copy Copy

sentiment_h <- sentiment(sentences_h)
sentiment_r <- sentiment(sentences_r)

# Checking the most negative sentences (element n sentence id)
sentiment_h %>% 
    arrange(sentiment) %>% 
    head(1)

sentiment_r %>% 
    arrange(sentiment) %>% 
    head(1)

# Checking the most negative sentences (text)

sentences_h[[148]][1]
sentences_r[[39]][1]

Hitler’s most negative sentence: The government of the United States of America, having violated in the most flagrant manner and in ever increasing measure all rules of neutrality in favor of the adversaries of Germany, and having continually been guilty of the most severe provocations toward Germany ever since the outbreak of the European war, brought on by the British declaration of war against Germany on 3 September 1939, has finally resorted to open military acts of aggression.
Roosevelt’s most negative sentence: I am not satisfied with the progress thus far made.

The next step is to visualize how the sentiment of both authors changed over the duration of the speech. For that, we will add two variables to the dataframe. One to identify the author of the speech and the other to identify the order of the sentence in the speech (a sort of time variable). We also union the two data frames to make the plot coding with ggplot2 easier.

content_copy Copy

# adding a column to identify author and sentence order
sentiment_h$author <- "Adolf Hitler"
sentiment_h$sentence_n <- as.numeric(rownames(sentiment_h))

sentiment_r$author <- "Franklin Roosevelt"
sentiment_r$sentence_n <- as.numeric(rownames(sentiment_r))

# union of the two df
df_union <- rbind(sentiment_h, sentiment_r)

To plot the sentiment using ggplot2, we assign the sentence order to the x axis, sentiment to the y axis and author to the color aesthetics. We then use geom_point to plot one point per sentence according to its sentiment and order in the speech. We use geom_smooth to visualise the trend of the sentiment through the speech. Read more about geom_smooth here.

The scale_color_manual layer allows us to choose the colors attributed to each author. Feel free to choose your colors and ggplot2 theme. To add the same ggplot2 theme as used in these plots, please check theme_coding_the_past(), our theme that is available here: ‘Climate data visualization with ggplot2’.

content_copy Copy

ggplot(data = df_union, aes(x = sentence_n, 
                            y = sentiment,
                            color = author))+
    geom_point(alpha = .4)+
    scale_color_manual(name = "", values=c("#FF6885", "white"))+
    geom_smooth(se=FALSE)+
    xlab("Sentence Order")+
    ylab("Sentiment")+
    theme_coding_the_past()

Note that the length of Roosevelt’s speech is shorter compared to Hitler’s. They both approach the declaration of war made by Germany against the US, but it is quite clear that the tone and emotions of Roosevelt are more positive. He starts low and increases the emotional tone until the end of the speech. The amplitude of Hitler’s emotions is a lot larger and, in general, the emotions are more negative.

In this case, sentiment analysis could be a powerful tool for a researcher to preselect which speeches to further analyze according to the emotional tone of interest. The method could also enrich a research comparing the speeches of more than two personalities and help to find personal styles and traces in the speeches of each personality. Finally, from a data science perspective, it would be interesting to know the differences in the results of sentiment analysis at the word level versus the analysis at the sentence level (as carried out in this post).

Feel free to leave your comment or question below and happy coding!

4. Conclusions

sentimentr package allows you to perform sentiment analysis in R, providing a powerful tool to estimate the emotional tone of sentences;
Sentiment analysis can be a powerful tool to preselect large amounts of texts and to find particular characteristics across different authors.

How to webscrape in R?

2024-09-10T00:00:00+00:00

In this lesson you will learn the basics of webscraping with the rvest R package. To demonstrate how it works, you will extract three speeches by Adolf Hitler from Wikipedia pages and analyze their word frequencies!

tips_and_updates

These speeches are analysed here strictly for research purposes. Read more about an academic project to make Hitler speeches available for research: Collection of Adolf Hitlers Speeches, 1933-1945

1. What is webscraping?

Simply put, webscraping is the process of gathering data on webpages. In its basic form, it consists of downloading the HTML code of a webpage, locating in which element of the HTML structure the content of interest is and, finally, extracting and storing it locally for further data analysis.

tips_and_updates

Keep in mind that webscraping can be more complex if the target website uses JavaScript to render content. In this case, consider combining rvest with other libraries, as described here.

2. How to webscrape in R?

There are several libraries developed to webscrape in R. In this lesson, we will stick to one of the most popular, rvest. This library is part of the tidyverse set of libraries and allows you to use the pipe operator (%>%). It is inspired by Python’s Beautiful Soup and RoboBrowser. The basic steps for webscraping with rvest would involve using the following functions:

read_html: Extracts the HTML source code associated with an URL;
html_elements: Extracts the relevant HTML elements from the HTML code;
html_text: Extracts the text (content) from the HTML elements;

tips_and_updates

There is a lot of debate on whether webscraping is ethical/legal or not. It depends a lot on where you are and the kind of content and purpose of your webscraping. Usually the robots.txt file of a website gives you hints about what is allowed and disallowed in a website. For more details on this debate, please check this link.

To illustrate how this works, we will extract the text of three speeches made by Adolt Hitler during the Second World War. The first step is to save the url of these speeches in a variable. We also load the necessary libraries. Please install them if you haven’t already done that.

content_copy Copy

library(rvest) # for webscraping
library(tidytext) # for cleaning text data
library(dplyr) # for data preparation
library(ggplot2) # for data viz

speech_01 <- "https://en.wikisource.org/wiki/Adolf_Hitler%27s_Address_at_the_Opening_of_the_Winter_Relief_Campaign_(4_September_1940)"
speech_02 <- "https://en.wikisource.org/wiki/Adolf_Hitler%27s_Address_to_the_Reichstag_(4_May_1941)"
speech_03 <-"https://en.wikisource.org/wiki/Adolf_Hitler%27s_Declaration_of_War_against_the_United_States"

Since we are going to extract the content of three speeches, it is a good idea to create a function to perform this task, since the same steps will repeat three times. If you inspect the URLs above, you will realize that the text content is located inside

(paragraph) tags. Therefore, our target is to extract these elements. Note that in Firefox and Chrome, you can inspect a webpage by right clicking any area of the page and clicking “inspect”. For other browsers the procedure should be similar. If you have difficulty finding this option, please check the browser documentation.

Our read_speech function is pretty straightforward. The read_html reads the URL of the webpage and delivers the HTML of it. The pipe operator %>% passes the output of one function to the input of the next one. html_elements extracts only paragraph tags from the code and, finally, html_text extracts the text from the paragraph tags.

content_copy Copy

read_speech <- function(url){
  speech <- read_html(url) %>% 
    html_elements("p") %>% 
    html_text()
}

speech_04_Sep_40 <- read_speech(speech_01)
speech_04_May_41 <- read_speech(speech_02)
speech_11_Dec_41 <- read_speech(speech_03)

At this point, if you check the results, you will note that the function delivers a text vector in which each element of the vector is one paragraph. We still need to make some adjustments because the first paragraph is only a small presentation of the speech, rather than part of it. Therefore we should eliminate the first element of the vector. For the speech of 4th of September and the one of 11th December, that is all we need to do. If you print the speech of 4th of May, you will see that the last 5 elements are also metadata and need to be excluded. The code below uses indexing to filter the data accordingly. Moreover, we transform all the dataframes into tibble - a more modern kind of dataframe - to make it easier to prepare the data in the next steps.

content_copy Copy

speech_04_Sep_40 <- speech_04_Sep_40[2:60]
speech_04_May_41 <- speech_04_May_41[2:60]
speech_11_Dec_41 <- speech_11_Dec_41[2:155]

# tibble creates a modern kind of dataframe with two columns: paragraph and text
speech_04_Sep_40 <- tibble(paragraph = 1:59, text = speech_04_Sep_40) 
speech_04_May_41 <- tibble(paragraph = 1:59, text = speech_04_May_41)
speech_11_Dec_41 <- tibble(paragraph = 1:154, text = speech_11_Dec_41)

3. Visualizing the most frequent words in Hitler’s speeches

Our next objective is to visualize the top 10 words in each Hitler’s speech. In order to do that, we will first prepare the data, transforming the dataframes from the previous step to contain one word per row with its respective count. Note that we will eliminate stopwords - words with little meaning for the analysis, like articles.

A function called count_words will be created to carry out data preparation. This function will expand the dataframe from the paragraph level to the word level. This is done by unnest_tokens, which transforms the table to one-token-per-row. It takes the “text” column as input and outputs a “word” column. anti_join eliminates rows containing stopwords. If you print stopwords you can see exactly which words are being eliminated. Finally, count counts how many times each word occurs.

content_copy Copy

count_words <- function(speech){
    speech_count <- speech %>% 
    unnest_tokens(output = word, input = text) %>% 
    anti_join(stop_words) %>% 
    count(word, sort = TRUE) 
}

speech_04_Sep_40_count <- count_words(speech_04_Sep_40)
speech_04_May_41_count <- count_words(speech_04_May_41)
speech_11_Dec_41_count <- count_words(speech_11_Dec_41)

Great, now we can use ggplot2 to visualize the top 10 words in each speech. Note that we specify the dataframe of interest with index filtering to keep only the top 10 words. Note, as well, that we reorder the bar plot so that bar start from most to least frequent word. We choose a color and eliminate the y-axis label. The same can be done for the two other speeches.

content_copy Copy

ggplot(data = speech_04_Sep_40_count[1:10,], aes(n, reorder(word, n))) +
  geom_col(color = "#FF6885", fill ="#FF6885") +
  labs(y = NULL)

Top 10 words used in Hitler’s speech of 4th September 1940

Top 10 words used in Hitler’s speech of 4th May 1941

Top 10 words used in Hitler’s speech of 11th December 1941

To add the same ggplot2 theme as used in these plots, please check theme_coding_the_past(), our theme that is available here: ‘Climate data visualization with ggplot2’.

Not surprisingly, “war” is a word that reaches the top 3 in all Hitler’s speeches. It is also interesting that other words refering to Britain, Balkans and Americans reflect the stage in which the war was. For example, in the speech of 11th of December, 1941, Hitler declares war on the US and therefore we observe a high frequency of words semantically related to the US. Please, leave your comment, questions or thoughts below and happy coding!

4. Conclusions

R can be an effective tool to perform webscraping, notably with the rvest package;
To smoothly clean webscraped content, you may use the tidytext package.

R vs Power BI

2024-06-23T00:00:00+00:00

1. What is R?

R is a programming language and an environment for statistical computing and visualization. R is not a general-purpose programming language, like Python or Java, because its focus is on statistical computing. The language is very popular in the academic environment and allows for complex calculations and algorithms.

2. What is Power BI?

Power BI is a set of softwares and applications focused on data analysis and visualization for Business Intelligence. For this article, when we talk about Power BI, we refer to Power BI Desktop, a drag and drop application used to transform, analyse and visualize data.

3. R vs Power BI

Below, a list of the main differences and similarities of R and Power BI is presented for several aspects:

Scope: While R is more suitable for academic and complex statistical data analysis, Power BI is more adequate for quick visual analyses. While R is common in the academic context, it can also be used in companies and industries that leverage data science for decision making. In this case, R would be used to prepare the data, train models and the Power BI to visualize the findings;
Learning Curve: Power BI is user-friendly and allows the creation of beautiful visualizations with a few clicks. R, on the other hand, has a steep learning curve. It requires a lot more training and reading more complex documentation before you can produce effective visualizations;
Interface: R is a written programming language, while most of tasks in Power BI are achieved with drag and drop actions;
Data Visualization: Power BI is limited in its visuals and customization options of reports and graphs, while R is flexible and versatile. There are many more chart types that can be plotted in R compared to Power BI. On the other hand, it is much easier and faster to plot appealing visualizations in Power BI compared to R;
Data Analysis: R provides libraries for advanced statistical operations that allow statistical inference, causal inference, machine learning and more complex analysis. Power BI is more suitable for answering simple Business Intelligence questions.
Price: Both platforms are free, but companies offer paid tools to enrich their functionalities.

4. R vs Power BI for digital humanities

R as well as Power BI might be used for digital humanities. R is perfect for analyses and visualizations for a scientific article. It is also the right option if you would like to implement complex algorithms. Power BI is a great fit if you would like to easily produce beautiful plots and enable user interactivity for a broader audience.

In education, for example, Power BI could be used to produce an interactive dashboard exploring the casualties of World War II. This could be used to teach history or bring insights to researchers on possible research questions.

Regarding R, this blog has plenty of examples of how to apply it to the humanities. I recommend this article where you learn about the use of synthetic control to investigate hypothesis in History: ‘When Numbers Meet Stories - an introduction to the synthetic control method in R’

5. R vs Power BI - Examples

To exemplify the differences and similarities of R and Power BI, we will replicate in Power BI the treemap plotted in R in the lesson Treemaps in R.

The data used in R is also available in a CSV file at this link. It is part of a great initiative by Professor Vincent Arel-Bundock to gather many interesting R datasets and make them available in CSV format on this page: R Datasets.

Power BI Desktop is free and you can download it from the Power BI Microsoft official page. To learn more about it and how to get started, please consult this resource.

In the lesson Treemaps in R we learnt how to plot a treemap in R. In this lesson we will plot the same treemap in Power BI. To do that, download the data above and save it in the desired folder.

When you open Power BI, you will see the option to load data from an Excel File. You can choose this option and a window will open to select the file with your data. You can then select all files to see also csv files. Select the cholera.csv file and confirm. You will be offered the option to transform your data in Power Query, a tool aimed at preparing your data before visualization. For this lesson, you can skip this step and load the data without transforming its structure.

On the bar to the right, you will see the variables of your dataset. We would like to create a treemap in which we have bigger rectangles representing the regions of London and smaller rectangles representing the districts within their respective region. The size of the rectangles will inform us about the mortality caused by cholera in a given region and district. These are the relevant variables for us:

region will define our outer rectangles (categories) and will represent regions of London (West, North, Central, South, Kent);
district will define our inner rectangles (details), representing the districts of London;
cholera_drate represents deaths caused by cholera per 10,000 inhabitants in 1849 and will define the size of rectangles

The first step is to select the cholera_drate field, as shown in the image below. You will realise that Power BI automatically creates a bar chart with the sum of all death rates.

Now, click on the bar plot and select the option Treemap in the Visualization tab, as shown in the image below.

The next step is to define which variable will determine the branches of our treemap, that is, the more general category. In our case, it is region. Finally, we define the field determining the leaves of our treemap. In this example, the leaves are the districts inside each region of London. Drag these two fields to category and details as shown below.

That’s it! Without any line of code, you created a treemap that offers a great visual of London cholera death rates by region and district. You have even automatically generated tooltips that provide additional information about each leaf in your tree. You can further format your plot to have your desired colors, fonts and sizes. Read more about how to format a visualization on this page. Below you see the formatted version of the treemap.

As you have seen, compared to R, it is easier to plot a treemap in Power BI. On the other hand, Power BI customization options are limited compared to R. Please, if you have any question or comments, feel free to write below and I wish you a great learning journey!

4. Conclusions

Both R and Power BI are great tools for data analysis. While R is more suitable for complex and academic applications, Power BI is user-friendly and produces beautiful visualizations with drag-and-drop actions;
Deciding whether to use R or Power BI depends on your goals and requirements, and the two tools can complement each other to produce effective results.