Text Analysis 101

Welcome to our third seminar in this series; today we will learn how to create our own corpora using social media data. Please make sure you have R and R Studio installed for the workshop today.

load the libraries and packages. (use install.package(“thepackagename”) if you haven’t installed these already.)

library(tidytext)
library(tidyverse)
## ── Attaching packages ────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ───────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(RedditExtractoR)
library(readtext)
library(quanteda)
## Package version: 2.1.2
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library(tokenizers)
library(rtweet)
## 
## Attaching package: 'rtweet'
## The following object is masked from 'package:purrr':
## 
##     flatten

Reddit data

We are going to use RedditExtractoR to create a Reddit corpus. Feel free to adjust the settings and search terms below according to your research interests. Note that this package has limits due to the Reddit API, including only 500 maximum comments per thread.

First we will test this by using specific search terms and subreddit(s) to obtain the data.

GAgarden <- get_reddit(search_terms = "garden in Georgia", regex_filter = "", subreddit = "gardening",
           cn_threshold = 1, page_threshold = 1, sort_by = "comments",
           wait_time = 2)
#turn reddit posts into quanteda object
View(GAgarden)
#excluding the post text bc every comment has post attached 
GAgarden_cleaned <- GAgarden %>% select(-post_text)

#creating the quanteda corpus object
redditGAgardencorpus <- quanteda::corpus(GAgarden_cleaned, text_field="comment")
summary(redditGAgardencorpus, 5)

Keep in mind that this data extraction produces a data frame with a flat structure: it does not preserving the order or hierarchy of user comments. This next demonstration shows how to create a data object based on a specific Reddit url, with the option to use this package’s network graph function.

#example network graph from a specific reddit page

cc_url = "https://www.reddit.com/r/climatechange/comments/bjavvf/a_debate_on_climate_change/"
url_data = reddit_content(cc_url)
graph_object = construct_graph(url_data)

#turn this reddit data into a corpus object:
cccorpus <- quanteda::corpus(url_data, text_field="comment")
summary(cccorpus)

Twitter data

Twitter has recently updated rules and regulations for researchers interested in utilizing their data. Check out the application for becoming a developer and utilizing the Twitter API here.

Using rtweet, users have two options for getting data. The first way utilizes the search_tweet function to get tweets, and a pop-up browser window will ask the user to authenticate the request. With this method, there are stricter limits to how many tweets and data the researcher can obtain. After authorizing the pop-up window, the authorization token will be stored in the user’s .Renviron file.

The other method involves creating a Twitter developer account (see above), which is recommended for researchers and for obtaining more data.

#the function search tweets has a default of 100 tweets. you can change this using the n argument below:
garden <- search_tweets("gardening", n = 10, include_rts = FALSE)
View(garden)

#Note: Rtweet includes multiple options for search functions including phrases

#search for a keyword
keyword <- search_tweets(q = "gardens")

# search for a phrase
phrase <- search_tweets(q = "Georgia red clay")

#search for multiple keywords
manykeywords <- search_tweets(q = "gardening AND clay")

#turn into a quanteda corpus object
twittergarden <- quanteda::corpus(garden)
summary(twittergarden)

Once your twitter developer application is successful, you will need to use your consumer key and consumer secret to create a token data object to secure the data. The code below includes the steps necessary to get your Twitter data with the developer account:

#First will store the name of the app 
my_app_name <- "twitter_app"

#insert your consumer key and consumer secret
consumer_key <- "your_key_here"
consumer_secret <- "your_secret_here"

#using these values, you will create a token data object, which is the Twitter authorization token

#create token
token <- create_token(app_name, consumer_key, consumer_secret)

#print token
token

#To use search_tweets to get more tweets, (rate limit is 18,000 per 15 minutes), set n to a higher number.

#you can also search for tweets by language


##search for tweets in english that are not retweets
English_tweets <- search_tweets("lang:en", include_rts = F)


#search for English tweets about gardening by geolocation
athens_tweets <- search_tweets("gardening", lang="en", geocode = lookup_coords("Athens, GA"))

#you can also get tweets in real time with different options for filtering

#randomly sampled
random_tweets <- stream_tweets(q = "", timeout = 30)

Other types of social media datasets.

Reddit and Twitter are of course not the only available social media, and there are many helpful sites, such as kaggle.com where precompiled and collected data can be obtained. The last corpus we will make today is from a dataset from GoodReads, a social media website for members to share and review books and to connect with other members. Goodreads has many many book reviews, recommendations, and ratings that may help librarians and readers to select relevant books. This dataset from GoodReads includes a variety of inspirational quotes, with helpful metadata including author and publication information (Verma 2021). You can download this data directly from kaggle.com here.

#insert your file path for the goodreads data below (Verma 2021)
#create data object
quotes <- read.csv("/Users/kikuiper/Documents/data_dh/quotes.csv")
View(quotes)

#note: these functions below are primarily from the tidytext package!

corp_quote_words <- quotes %>%
  unnest_tokens(output = word, 
                input = quote, 
                token = "words")


#there are plenty of options for analyzing your data in this format, or you can turn it into a quanteda corpus object, as shown below:

goodreads <- quanteda::corpus(quotes$quote)
summary(goodreads)

More processing options: stemming, stopwords, etc.

#this is an example of stemming, which is an optional preprocessing step in text analysis
goodreadsstems <- tokenize_word_stems(quotes$quote)
print(goodreadsstems)

#implementing stopwords
goodreads_cleaned <- corp_quote_words %>% anti_join(stop_words) 
View(goodreads_cleaned)

goodreads_cleaned %>% count(word, sort = T)

#the tokenizers package also includes this helpful function tokenize_tweets, which cleans up twitter data but preserves important aspects like hashtags and usernames. 
gardentokens <- tokenize_tweets(garden$text)
print(gardentokens)

Tune in next week to learn about options for analyzing your social media data!

Works Cited

Batrinca, Bogdan & Philip Treleaven. 2014. Social media analytics: a survey of techniques, tools, and platforms. AI & Society.
Beckman, Matthew, Stéphane Guerrier, Justin Lee, Roberto Molinari, Samuel Orso & Iegor Rudnytskyi. 2020. An Introduction to Statistical Programming with R.
Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, Akitaka Matsuo, William Lowe. 2018. “quanteda: An R package for the quantitative analysis of textual data.” Journal of Open Source Software,3(30), 774. doi:10.21105/joss.00774
Bird, Steven, Ewan Klein, and Edward Loper. 2019. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit.
Brezina, Vaclav. 2018. Statistics in Corpus Linguistics.
Brown, Simon. 2016. Tips for Computational Text Analysis.
Bussiere, Kirsten. 2018. Digital Humanities - A Primer.
Evert, Stefan. 2007. Corpora and collocations.
Feinerer, Ingo. 2020. Introduction to the tm Package: Text Mining in R.
Freelon, Deen. http://socialmediadata.wikidot.com/ Han, Na-Rae. Python 3 tutorials. Kearney, Matthew. 2018. R: Collecting and Analyzing Twitter Data: featuring {rtweet}. NiCAR 2018.
Kearney, Matthew, Andrew Heiss, and Francois Briatte. 2020. Package ‘rtweet’.
Kuiper, Katie Ireland. 2021. Text Analysis Glossary. DigiLab.
Laudun, John. Text Analytics 101.
Lincoln, Mullen. 2018. Introduction to tokenizers.
Machlis, Sharon. 2020. How to search Twitter with rtweet and R. infoworld.com
2020.Modern Perl: Why Perl Rules for Text. https://somedudesays.com/2020/02/modern-perl-why-perl-rules-for-text/
https://monkeylearn.com/text-analysis/
Millot, Thomas. Photo. Unsplash
Morikawa, Rei. 2019. 12 Best Social Media Datasets for Machine Learning.
O’Connor, Brendan, David Bamman, and Noah Smith. 2011. Computational Text Analysis for Scoial Science: Model Assumptions and Complexity.
Parlante, Nick. 2002. Essential Perl.
Rivera, Ian. 2019. package RedditExtractoR.
Rüdiger, Sophia, and Daria Dayter. 2020. Corpus Approaches to Social Media. In Studies in Corpus Linguistics.
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach.
Wiedemann, Gregor & Niekler, Andreas. 2017. Hands-on: A five day text mining course for humanists and social scientists in R. Proceedings of the 1st Workshop on Teaching NLP for Digital Humanities ( 2017), Berlin.
Verma, Abhishek. 2021. Inspirational Quotes from GoodReads website.
Wickham et al. 2019. Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
Witten, Ian. 2004. Text mining.