Text Analysis 101

Text Analysis for Literature and Beyond!

Text analysis and literature make excellent companions. R is a great choice for analyzing your literary (or other types) of data! Today we will work with a few literary texts and learn how to get basic frequencies, concordance lines to analyze keywords-in-context, dispersion, and sentiment.

Install and upload the necessary libraries to get and analyze the data.

## Package version: 2.1.2

## Parallel computing: 2 of 8 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following objects are masked from 'package:quanteda':
## 
##     meta, meta<-

## 
## Attaching package: 'tm'

## The following objects are masked from 'package:quanteda':
## 
##     as.DocumentTermMatrix, stopwords

#Get your dataset. Today we will analyze Frankenstein by Mary Shelley and Jane Eyre by Charlotte Bronte.

frank <- texts(readtext("https://www.gutenberg.org/files/84/84-0.txt"))
View(frank)
names(frank) <- "Frankenstein"
#tokenize the text. (you can also use this function to tokenize in different ways (ie by sentence, paragraph, etc.))
frank_tokens <- tokenize_words(frank)

Now we are ready to do some basic analyses! First we will use count and methods from the tidyverse to get frequencies.

#turn into a tibble object to get at word frequencies
franktibble <- tibble(word = frank_tokens[[1]])
View(franktibble)
#count the words
frankcount <- franktibble %>%
  count(word, sort = TRUE) 
View(frankcount)

#get multiword units: bigrams and trigrams from the text
Multiword units are useful for drawing conclusions about frequent and dominant conent in the data; for identifying frequent patterns and analyzing how they contribute to overall themes and discourses. Multiword units contribute to organizing discourses, introducing new content, and structuring discourse.

frank_bigrams <- tibble(text = frank) %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)
#show by frequency 
frank_bigrams %>%
count(bigram, sort = TRUE)
frank_trigrams <- tibble(text = frank) %>% unnest_tokens(trigram, text, token = "ngrams", n = 3)
#by frequency 
frank_trigrams %>%
count(trigram, sort = TRUE)

#applying a stopword list

frankcount_cleaned <- frankcount %>% anti_join(stop_words) 
frankcountcleaned <- count(frankcount_cleaned, sort = T)
View(frankcount_cleaned)
frankcount_cleaned %>% filter(n > 3) %>% mutate(word = reorder(word, ggplot(aes(n, word)) + geom_col() +labs(y = NULL)))
ggplot(frankcount_cleaned %>% filter(n > 35) %>% 
  mutate(word = reorder(word, n)), aes(word, n)) + geom_bar(stat = "identity")

ggplot(frankcount_cleaned %>% filter(n > 35) %>% 
  mutate(word = reorder(word, n)), aes(word, n)) + geom_bar(stat = "identity")+theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Top Content Words in Frankenstein")

#now some cool methods from quanteda!

#first data prep: turn the frank text into a quanteda corpus object

frankencorpus <- corpus(frank)
#get info about the corpus
summary(frankencorpus)

#quanteda offers many methods and options for analyzing your dataset.

jane <- texts(readtext("https://www.gutenberg.org/files/1260/1260-0.txt"))
names(jane) <- "Jane Eyre"

janecorpus <- corpus(jane)

#get info about the corpus
summary(janecorpus)
#add two corpora together 
bothcorpora <- janecorpus + frankencorpus
summary(bothcorpora)

#Investigating and Understanding the texts through the kwic function
Analyzing concordance lines has a long history in text analysis and supports further understandings of linguistic patterns and rhetorical meanings and larger themes and discourses in the text(s).

Keyword-in-context function

kwic(frankencorpus, "terror")
kwic(frankencorpus, "lady")
kwic(frankencorpus, "creature")
kwic(janecorpus, "lady")
kwic(janecorpus, "lady", valuetype = "regex")

We can also use quanteda to get the dispersion of specific token(s) over novel time:

#lexical dispersion plot with quanteda
textplot_xray(
     kwic(janecorpus, pattern = "lady"),
     kwic(janecorpus, pattern = "sir")
)

#now with the frank corpus
textplot_xray(
     kwic(frankencorpus, pattern = "science"),
     kwic(frankencorpus, pattern = "terror")
)

#testing Jocker’s sentiment analysis package for literary analysis
Jockers has written extensively about analyzing sentiment over novel time, using Vonnegut’s work as an example.

library(syuzhet)
library(zoo)

#get the sentences from frank to obtain raw sentiment values
frank_sentences_v <- get_sentences(frank)
frank_sentiments_v <- get_sentiment(frank_sentences_v)

#plot it
plot(
frank_sentiments_v, type = "l",
xlab = "Novel Time", ylab = "Sentiment",
main = "Raw Sentiment Values in Frankenstein" )

#testing different methods of smoothing, this one uses a rolled mean 
frank_window <- round(length(frank_sentiments_v)*.1)
#gets rolling mean values for sentiments using the window above 
frank_rolled <- rollmean(frank_sentiments_v, k = frank_window)
#this creates a scaled vector of values from 0 to 1; making it possible to compare di fferent plots more easier
frank_scaled <- rescale_x_2(frank_rolled)

plot(frank_scaled$x, frank_scaled$z,
type="l",
col="blue",
xlab="Narrative Time",
ylab="Emotional Valence",
main = "Frankenstein with Rolling Means"
)

#creating and defining a dictionary using quanteda; #if we are interested in different groups of words in frankenstein

frank_dict <- dictionary(list(terror = c("terror", "darkness", "horror", "power"),
                          affection = c("affection", "father", "light", "joy")))

frankdfmwdict <- dfm(frankencorpus, dictionary = frank_dict)
frankdfmwdict

#using quanteda dfm to get top features

jane_dfm <- dfm(janecorpus, remove = stopwords("english"), remove_punct = TRUE)
jane_dfm
topfeatures(jane_dfm, 20)

#create a wordcloud of top tokens
textplot_wordcloud(jane_dfm, min_count = 6, random_order = FALSE,
                   rotation = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

Works Cited
Biber, Douglas. 2011. Corpus linguistics and the study of literature: Back to the future? Scientific Study of Literature.
Bird, Steven, Ewan Klein, and Edward Loper. 2019. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit
Blaette, Andreas. 2020. Introducing the ‘polmineR’-package. Brezina, Vaclav. 2018. Statistics in Corpus Linguistics.
Brown, Simon. 2016. Tips for Computational Text Analysis. Bussiere, Kirsten. 2018. Digital Humanities - A Primer.
Cohen Minnick, Lisa. 2004. Dialect and Dichotomy: Literary Representations of African American Speech.
Evert, Stefan and Andrew Hardie. 2011. Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. In Proceedings of the Corpus Linguistics 2011 conference, University of Birmingham, UK.
Evert, Stefan. 2003. The CQP Query Language Tutorial.
Evert, Stefan. 2007. Corpora and collocations.
Feinerer et al. 2008.
Feinerer, Ingo. 2020. Introduction to the tm Package: Text Mining in R.
Fischer-Starke, Bettina. 2010. Corpus Linguistics in Literary Analysis: Jane Austen and Her Contemporaries.
Firth, JR. 1957. Papers in Linguistics. London: OUP.
Han, Na-Rae. Python 3 tutorials.
HathiTrust. https://www.hathitrust.org/about.
Jockers, Matthew. 2020. Introduction to the Syuzhet Package.
Kuiper, Katie Ireland. 2021. Text Analysis Glossary. DigiLab.
Kretzschmar, William, C. Darwin, C. Brown, D. Rubin, D. Biber. Looking for the Smoking Gun: Principled Sampling in Creating the Tobacco Industry Documents Corpus. Journal of English Linguistics. 32:1.
Laudun, John. Text Analytics 101.
Loria, Steven. 2020. TextBlob: Simplified Text Processing.
2020.Modern Perl: Why Perl Rules for Text.
Millot, Thomas. Photo. Unsplash
O’Connor, Brendan, David Bamman, and Noah Smith. 2011. Computational Text Analysis for Scoial Science: Model Assumptions and Complexity.
Parlante, Nick. 2002. Essential Perl. http://cslibrary.stanford.edu/108/EssentialPerl.html.
Project Gutenberg. https://www.gutenberg.org
Sankoff, D. & Sankoff, G. Sample survey methods and computer-assisted analysis in the study of grammatical variation. In Darnell R. (ed.) Canadian Languages in their Social Context Edmonton: Linguistic Research Incorporated. 1973. 7–64.
Wiedemann, Gregor & Niekler, Andreas. 2017. Hands-on: A five day text mining course for humanists and social scientists in R. Proceedings of the 1st Workshop on Teaching NLP for Digital Humanities (Teach4DH@GSCL 2017), Berlin.
Witten, Ian. 2004. Text mining.

Text Analysis 101

Katie Ireland Kuiper

4/7/2021

Text Analysis for Literature and Beyond!