Text Data Visualization
What you will learn
- Learn to load and tokenize text data into Python;
- Be able to clean your data to retain only the relevant information;
- Learn to count words in a list;
- Be comfortable with visualizing text data;
Table of Contents
- Data source
- Coding the past: text data visualization
“Words have no power to impress the mind without the exquisite horror of their reality.”
Edgar Allan Poe
Have you ever found yourself submerged in text data, your eyes scanning countless words as you try to extract meaningful insights for your research? Text data visualization could be the solution you’re seeking. In our modern world, textual data, be it from historical documents or the latest tweets, has become a deep well of knowledge just waiting to be discovered.
Whether you’re tracing societal trends over time or studying the latest social media topics, analyzing and visualizing text data can be a gold mine. In this lesson, we’ll guide you on how to navigate this rich universe of words. Harnessing the strength of Natural Language Toolkit (NLTK) and the Matplotlib library, we’ll delve into strategies for text data visualization and analysis, illuminating new angles for your research.”
The data used in this lesson is available on the Oxford Text Archive website. It consists of a collection of pamphlets published between 1750 and 1776 by influential authors in the British colonies. These pieces depict the debate with England over constitutional rights, showing the colonists’ understanding of their contemporary events and the conditions that precipitated the American Revolution. In this lesson, we will focus on the pamphlets of Oxenbridge Thacher, James Otis, and James Mayhew. To know more about textual data sources, check this post: ‘Where to find and how to load historical data’
Coding the past: text data visualization
1. Import text file into python
To load text files in Python and reuse our code, we can build a function. Before we start to write the function, all libraries necessary for this lesson will be loaded.
with statement will ensure that the opened file is closed when the block inside it is finished. Note that we use “latin-1” encoding. The function
islice() creates an iterable object and a for loop is used to slice the file into chunks (lines). Each line is appended to the list
word_tokenize is a function from the NLTK library that splits a sentence into words. All the sentences are then split into words and stored in a list. Note that the list needs to be flattened into a single list, since the tokenizer returns a list of lists. This is done with a list comprehension.
Now we load the manifests of three authors: Oxenbridge Thacher, James Otis, and James Mayhew. The results are stored in three lists called
If you check the length of the lists, you will see that Oxenbridge Thacher’s manifest has approximately 4,156 words; James Mayhew, 18,969 words; and James Otis, 34,031 words.
2. Understand nltk stopwords
In this function, we will use NLTK stopwords to remove all words that do not add any meaning to our analysis. Moreover, we transform all characters to lowercase and remove all words containing two or fewer characters.
We apply the function to the three lists of words. After the cleaning process, the number of words is reduced to less than 50% of the original size.
3. Word counter in python
The function below counts the frequency of each word and returns a dataframe with the words and their frequencies, sorted by the frequency.
4. Word count visualization
We will use the
matplotlib library to create a bar plot with the 10 most frequent words in each manifest. We use
iloc to select the first 10 rows of each dataframe.
barh creates a horizontal bar plot where the words are on the y-axis and the frequency on the x-axis. After that, we set the title of each plot and perform a series of adjustments to the plot, including the elimination of the grid, the removal of part of the frame, and the change in font and background colors. Finally we also use the tight layout function to adjust the spacing between the plots.
5. Calculate the proportion of each word and comparing the manifests
Finally, we calculate the proportion of each word in each manifest relative to the total number of words in that document and store them in a new column called “proportion”. We also create two new data frames, one for each pair of manifests: one to compare Thacher and Otis, and the other to compare Thacher and Mayhew. This is done by an outer join, using the
word column as the key. This operation keeps all the words, even the ones that are not included in both datasets, and fills the missing values with 0.
Now we will compare the three manifests by plotting the proportion of each word in Thacher on the x-axis and the proportion of the same word in Otis on the y-axis. We will use the
scatter function to create a scatter plot in which the coordinates are the frequencies of a given word in Thacher and Otis. We will also use the
annotate function to label each point with the word. The same procedure will be used to compare Thacher and Mayhew. Note that the more similar the manifests, the more points will be concentrated in the diagonal line (same frequency in both manifests).
This text data visualization highlights the fact that Thacher and Otis are more similar than Thacher and Mayhew. This is reflected in the scatterplot, where the points are more concentrated in the diagonal line in the plot relating Thacher and Otis compared to the one relating Thacher and Mayhew. This is a simple way to compare the similarity of two texts. We know, for example, that, while Thacher talks a lot about “colonies”, Mayhew talks a lot about “god”.
- You can tokenize text data with the NLTK library method
- With list comprehensions, you can treat text to eliminate irrelevant characters and words;
- Matplotlib is an excellent option for text data visualization