Text Analysis for Literature and Beyond!

Text analysis and literature make excellent companions. R is a great choice for analyzing your literary (or other types) of data! Today we will work with a few literary texts and learn how to get basic frequencies, concordance lines to analyze keywords-in-context, dispersion, and sentiment.

Install and upload the necessary libraries to get and analyze the data.

#Get your dataset. Today we will analyze Frankenstein by Mary Shelley and Jane Eyre by Charlotte Bronte.

frank <- texts(readtext("https://www.gutenberg.org/files/84/84-0.txt"))
names(frank) <- "Frankenstein"
#tokenize the text. (you can also use this function to tokenize in different ways (ie by sentence, paragraph, etc.))
frank_tokens <- tokenize_words(frank)

Now we are ready to do some basic analyses! First we will use count and methods from the tidyverse to get frequencies.

#turn into a tibble object to get at word frequencies
franktibble <- tibble(word = frank_tokens[[1]])
#count the words
frankcount <- franktibble %>%
  count(word, sort = TRUE) 

#get multiword units: bigrams and trigrams from the text
Multiword units are useful for drawing conclusions about frequent and dominant conent in the data; for identifying frequent patterns and analyzing how they contribute to overall themes and discourses. Multiword units contribute to organizing discourses, introducing new content, and structuring discourse.

frank_bigrams <- tibble(text = frank) %>% unnest_tokens(bigram, text, token = "ngrams", n = 2)
#show by frequency 
frank_bigrams %>%
count(bigram, sort = TRUE)
frank_trigrams <- tibble(text = frank) %>% unnest_tokens(trigram, text, token = "ngrams", n = 3)
#by frequency 
frank_trigrams %>%
count(trigram, sort = TRUE)

#applying a stopword list

frankcount_cleaned <- frankcount %>% anti_join(stop_words) 
frankcountcleaned <- count(frankcount_cleaned, sort = T)
frankcount_cleaned %>% filter(n > 3) %>% mutate(word = reorder(word, ggplot(aes(n, word)) + geom_col() +labs(y = NULL)))
ggplot(frankcount_cleaned %>% filter(n > 35) %>% 
  mutate(word = reorder(word, n)), aes(word, n)) + geom_bar(stat = "identity")

ggplot(frankcount_cleaned %>% filter(n > 35) %>% 
  mutate(word = reorder(word, n)), aes(word, n)) + geom_bar(stat = "identity")+theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Top Content Words in Frankenstein")

#now some cool methods from quanteda!

#first data prep: turn the frank text into a quanteda corpus object

frankencorpus <- corpus(frank)
#get info about the corpus

#quanteda offers many methods and options for analyzing your dataset.

jane <- texts(readtext("https://www.gutenberg.org/files/1260/1260-0.txt"))
names(jane) <- "Jane Eyre"

janecorpus <- corpus(jane)

#get info about the corpus
#add two corpora together 
bothcorpora <- janecorpus + frankencorpus

#Investigating and Understanding the texts through the kwic function
Analyzing concordance lines has a long history in text analysis and supports further understandings of linguistic patterns and rhetorical meanings and larger themes and discourses in the text(s).

Keyword-in-context function

kwic(frankencorpus, "terror")
kwic(frankencorpus, "lady")
kwic(frankencorpus, "creature")
kwic(janecorpus, "lady")
kwic(janecorpus, "lady", valuetype = "regex")

We can also use quanteda to get the dispersion of specific token(s) over novel time:

#lexical dispersion plot with quanteda
     kwic(janecorpus, pattern = "lady"),
     kwic(janecorpus, pattern = "sir")

#now with the frank corpus
     kwic(frankencorpus, pattern = "science"),
     kwic(frankencorpus, pattern = "terror")

#testing Jocker’s sentiment analysis package for literary analysis
Jockers has written extensively about analyzing sentiment over novel time, using Vonnegut’s work as an example.


#get the sentences from frank to obtain raw sentiment values
frank_sentences_v <- get_sentences(frank)
frank_sentiments_v <- get_sentiment(frank_sentences_v)

#plot it
frank_sentiments_v, type = "l",
xlab = "Novel Time", ylab = "Sentiment",
main = "Raw Sentiment Values in Frankenstein" )
#testing different methods of smoothing, this one uses a rolled mean 
frank_window <- round(length(frank_sentiments_v)*.1)
#gets rolling mean values for sentiments using the window above 
frank_rolled <- rollmean(frank_sentiments_v, k = frank_window)
#this creates a scaled vector of values from 0 to 1; making it possible to compare di fferent plots more easier
frank_scaled <- rescale_x_2(frank_rolled)

plot(frank_scaled$x, frank_scaled$z,
xlab="Narrative Time",
ylab="Emotional Valence",
main = "Frankenstein with Rolling Means"

#creating and defining a dictionary using quanteda; #if we are interested in different groups of words in frankenstein

frank_dict <- dictionary(list(terror = c("terror", "darkness", "horror", "power"),
                          affection = c("affection", "father", "light", "joy")))

frankdfmwdict <- dfm(frankencorpus, dictionary = frank_dict)

#using quanteda dfm to get top features

jane_dfm <- dfm(janecorpus, remove = stopwords("english"), remove_punct = TRUE)
topfeatures(jane_dfm, 20)

#create a wordcloud of top tokens
textplot_wordcloud(jane_dfm, min_count = 6, random_order = FALSE,
                   rotation = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

