This is the last seminar in our Text Analysis 101 series, on applications of analyzing social media! Today we will learn how to do some basic corpus functions on two social media corpora and utilize different dictionaries for sentiment analysis. Many different text analysis methods are useful for analyzing social media data; it depends on what your goals and research question(s) are for your work!
Today we will also learn how to create several visualizations that are prevalent in text analysis, including frequencies, and specific visualizations to social media data.
Last week, we learned how to utilize R packages to get data from Reddit, Twitter, and other types of social media. Check out the lecture and handout here on the DigiLab Resources page.

Install and initialize the packages we are using today.


The first dataset is from Goodreads, a social media website for members to share and review books and to connect with other members. Goodreads has book reviews, recommendations, and ratings that may help librarians and readers to select relevant books. This dataset from GoodReads includes a variety of inspirational quotes, with helpful metadata including author and publication information (Verma 2021). You can download this data directly from here.

#upload the data; basic quanteda functions

#insert your file path
quotes <- read.csv("/Users/kikuiper/Documents/data_dh/quotes.csv")
#turn it into a quanteda corpus; use dplyr if you want to adjust want metadata is included in the corpus object
goodreads <- quanteda::corpus(quotes$quote)

#basic functions:
#search for a particular token, phrase
#get top words
goodreads_tokens <- tokens(goodreads)
kwic_love <- kwic(goodreads_tokens, pattern =  "love")
View(kwic_love) #this will open in viewer window
#Note: the pattern can be adjusted to include different options
#multiple keywords can be searched for, as below:
kwic_multiple <- kwic(goodreads_tokens, pattern = c("love", "life"))
#use window argument to adjust number of words on either side
kwic3 <- kwic(goodreads_tokens, pattern = "life", window = 4)
#use pattern = phrase("insert phrase*") to look for different

#visualize distribution of a particular token with xray plot
kwic(tokens(goodreads), pattern = "love") %>%

More options with quanteda! Including a word cloud visualization

#create a dfm object to prepare for a wordcloud visualization
dfm_quotes <- corpus_subset(goodreads) %>% 
  dfm(remove = stopwords('english'), remove_punct = TRUE) %>%
  dfm_trim(min_termfreq = 3, verbose = FALSE)

#word cloud time
#adjust settings, add in specific colors
textplot_wordcloud(dfm_quotes, min_count = 25,
                   color = c('red', 'pink', 'green', 'purple', 'orange', 'blue'))

#check out particular tokens
features_dfm_goodreads <- textstat_frequency(dfm_quotes, n = 100)

# Sort by reverse frequency order
features_dfm_goodreads$feature <- with(features_dfm_goodreads, reorder(feature, -frequency))

#what do you notice about this plot?
ggplot(features_dfm_goodreads, aes(x = feature, y = frequency)) +
  geom_point() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
#make it smaller by creating another subset
topten <- textstat_frequency(dfm_quotes, n = 10)
#reorder by token frequency
topten$feature <- with(topten, reorder(feature, -frequency))
#plot it
ggplot(topten, aes(x = feature, y = frequency)) +
  geom_point() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
#other helpful functions include corpus_subset() which
#allows you to select based on metadata aspects (like author or location)

Now, onto the Twitter data!

#first get the data
garden <- search_tweets("gardening", n = 35, include_rts = FALSE)

#use this syntax to get the number of different locations (or other metadata options)

#plot the different locations
garden %>%
  ggplot(aes(location)) +
  geom_bar() + coord_flip() +
  labs(x = "Count",
       y = "Location",
       title = "Locations in Garden Tweets")

Other helpful options with tidytext

#use the unnest function to get all the words separated out for frequency analysis 
tidy_tweets <- garden %>% unnest_tokens(word, text, token = "tweets")
#check out the data!

#subset the data by screen name, counts, and word tokens
groups <- tidy_tweets %>% group_by(screen_name, word) %>% summarize(count=n())

#get frequency counts
frequency <- tidy_tweets %>% count(word, sort = T)

#a few more cleaning options: lowercase all, implement stopwords
withstopwords <- tidy_tweets %>%  filter(!word %in% stop_words$word,!word %in% str_remove_all(stop_words$word, "'"),str_detect(word, "[a-z]"))

withstopwords <- withstopwords %>% count(word, sort = T)

#now plot it
withstopwords %>% filter(n > 1) %>%  ggplot(aes(x = reorder(word, -n), y = n)) +
  geom_col() +
  labs(x = "word",
       y = "count",
       title = "Top words") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#get bigrams
bigram_tweets <- garden %>% unnest_tokens(bigram, text, token = "ngrams", n = 2, collapse = F) 

#prep for visualization
sep_bigrams <- bigram_tweets %>% separate(bigram, c("word1", "word2"), sep = " ") %>% count(word1, word2, sort = T) %>%select(word1, word2, n)


#visualize it
sep_bigrams %>%
  filter(n > 1) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  # geom_edge_link(aes(edge_alpha = n, edge_width = n))
  # geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8, size = 3) +
  labs(title = "Word Network:Garden Tweets",
       subtitle = "Optional subtitle here",
       x = "", y = "")

#using quanteda and tidy for sentiment analysis #This illustration uses two different sentiment dictionaries; the AFINN lexicon from Neilsen 2011 and quanteda sentiment dictionary, which is composed of data meant for evaluating sentiment in news coverage, legislative speech,and other text. If you want to read more about testing related to this, check out Young and Soroka 2012.

#now turn twitter data into quanteda corpus object
twittergarden <- quanteda::corpus(garden)

#test quanteda sentiment dictionary 

#create a dfm object with the dictionary
tweetsent <- dfm(twittergarden, dictionary = data_dictionary_LSD2015)
#they are currently updating the package (you can install it from github)
#once updated will be able to examine with text length taken into account
#instead of just basic dfm object

#now testing the afinn dictionary
#AFINN is a general purpose lexicon, based on single tokens. It scores words on a scale of -5 to 5; negative to positive sentiment.
withstopwords <- tidy_tweets %>%  filter(!word %in% stop_words$word, !word %in% str_remove_all(stop_words$word, "'"),str_detect(word, "[a-z]"))

afinn <- withstopwords %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(status_id) %>% 
  summarise(sentiment = sum(value)/n()) %>% #this normalizes it by dividing the output by number of words in each tweet
  mutate(method = "AFINN")
#plot it!
afinn %>% ggplot(aes(status_id, sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")
#if it was bigger dataset could compare by date or user etc

#plot it in color
afinn %>% ggplot(aes(status_id, sentiment, fill = status_id)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")
#if it was bigger dataset could compare by date or user etc

#getting top hashtags and top user networks This uses functions from quanteda to extract the top hashtags and top users. functions fcm() and textplotnetwork() utilize cooccurrences of hashtags or usernames to visualize the networks.

#create a document feature matrix
tweet_dfm <- tokens(twittergarden, remove_punct = TRUE) %>%
#get top hashtags 
hashtags_dfm <- dfm_select(tweet_dfm, pattern = "#*")
top <- names(topfeatures(hashtags_dfm, 50))

tag_fcm <- fcm(hashtags_dfm)

#plot it here!
topgat_fcm <- fcm_select(tag_fcm, pattern = top)
textplot_network(tag_fcm, min_freq = 0.1, edge_alpha = 0.8, edge_size = 5)

#another option is to extract most frequently mentioned usernames
users_dfm <- dfm_select(tweet_dfm, pattern = "@*")
topusers <- names(topfeatures(users_dfm, 50))

#construct feature co-occurrence matrix of usernames
users_fcm <- fcm(users_dfm)

#plot it
users_fcm2 <- fcm_select(users_fcm, pattern = topuser)
textplot_network(users_fcm2, min_freq = 0.1, edge_color = "orange", edge_alpha = 0.8, edge_size = 5)

Works Cited

Batrinca, Bogdan & Philip Treleaven. 2014. Social media analytics: a survey of techniques, tools, and platforms. AI & Society.
Beckman, Matthew, Stéphane Guerrier, Justin Lee, Roberto Molinari, Samuel Orso & Iegor Rudnytskyi. 2020. An Introduction to Statistical Programming with R.
Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, Akitaka Matsuo, William Lowe. 2018. “quanteda: An R package for the quantitative analysis of textual data.” Journal of Open Source Software,3(30), 774. doi:10.21105/joss.00774
Bird, Steven, Ewan Klein, and Edward Loper. 2019. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit.
Brezina, Vaclav. 2018. Statistics in Corpus Linguistics.
Brown, Simon. 2016. Tips for Computational Text Analysis.
Bussiere, Kirsten. 2018. Digital Humanities - A Primer.
Csardi G, Nepusz T (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695.
Evert, Stefan. 2007. Corpora and collocations.
Feinerer, Ingo. 2020. Introduction to the tm Package: Text Mining in R.
Freelon, Deen. Han, Na-Rae. Python 3 tutorials. Kearney, Matthew. 2018. R: Collecting and Analyzing Twitter Data: featuring {rtweet}. NiCAR 2018.
Kearney, Matthew, Andrew Heiss, and Francois Briatte. 2020. Package ‘rtweet’.
Kross, Sean et al. 2020. swirl: Learn R, in R.
Kuiper, Katie Ireland. 2021. Text Analysis Glossary. DigiLab.
Laudun, John. Text Analytics 101.
Lincoln, Mullen. 2018. Introduction to tokenizers.
Machlis, Sharon. 2020. How to search Twitter with rtweet and R.
2020.Modern Perl: Why Perl Rules for Text.
Millot, Thomas. Photo. Unsplash
Morikawa, Rei. 2019. 12 Best Social Media Datasets for Machine Learning. Nielsen, F. 2011. AFINN lexicon.
O’Connor, Brendan, David Bamman, and Noah Smith. 2011. Computational Text Analysis for Scoial Science: Model Assumptions and Complexity.
Parlante, Nick. 2002. Essential Perl.
Pederson, Thomas. 2021. ggraph: an implementation of grammar of graphics for graphs and networks.
Rivera, Ian. 2019. package RedditExtractoR.
Rüdiger, Sophia, and Daria Dayter. 2020. Corpus Approaches to Social Media. In Studies in Corpus Linguistics.
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach.
Tufekci, Zeynep. 2017. [Twitter and Tear Gas: The Power and Fragility of Networked Protest.] ( Yale University Press.
Wiedemann, Gregor & Niekler, Andreas. 2017. Hands-on: A five day text mining course for humanists and social scientists in R. Proceedings of the 1st Workshop on Teaching NLP for Digital Humanities ( 2017), Berlin.
Verma, Abhishek. 2021. Inspirational Quotes from GoodReads website.Wasser, Leah, and Carson Farmer. 2020. Twitter Data in R Using Rtweet: Analyze and Download Twitter Data. Earth Data Science.
Watanabe, Kohei. 2021. Example: social media analysis. quanteda package examples.
Wickham et al. 2019. Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686,
Wiedemann, Gregor & Niekler, Andreas. 2017. Hands-on: A five day text mining course for humanists and social scientists in R. Proceedings of the 1st Workshop on Teaching NLP for Digital Humanities ( 2017), Berlin.
Witten, Ian. 2004. Text mining.