Agenda
- Regular expressions
- Basic text analysis
- corpora
- word clouds
- sentiment analysis
- Intermediate
- term frequency x inverse document frequency (tf-idf)
- Latent Dirichlet allocation (LDA; topic modeling)
Announcements
- Final Project update
- MDSR Ch 15 programming notebook assigned
- Beckman’s office hour cancelled on Thurs 4/4 (see syllabus for other options)
- MDSR Ch 15 Exercises due 4/14 before midnight
- Lots of errata/tips this chapter
MDSR Ch 15 Errata / Tips
- Some sections don’t require programming, but please still include the headers for navigation purposes
- p. 361: Note–
data("DataSciencePapers")
should appear BOTH in front matter (per style guide) AND as a code chunk in sequence with the other MDSR code so your section 15.2 results in the programming notebook will match the MDSR book rather than using new data queried from arXiv also loaded on p. 361 under the object name DataSciencePapers
- p. 365: your word cloud won’t be identical and may even exclude “data” if it doesn’t fit on the page
- p. 368: Wikipedia changed the tables all around. Use
Table[[4]]
when scraping; Title
is now Song
; the results won’t exactly match but you’ll be able to work it out.
- p. 369: change
25
to 15
(the result is quite a different message!)
- p. 370: you don’t need a Twitter account, the provided credentials work… here they are:
- consumer_key = “u2UthjbK6YHyQSp4sPk6yjsuV”
- consumer_secret = “sC4mjd2WME5nH1FoWeSTuSy7JCP5DHjNtTYU1X6BwQ1vPZ0j3v”
- access_token = “1365606414-7vPfPxStYNq6kWEATQlT8HZBd4G83BBcX4VoS9T”
- access_secret = “0hJq9KYC3eBRuZzJqSacmtJ4PNJ7tNLkGrQrVl00JHirs”
- p. 372: you’ll need to load the
RSQLite
library
- p. 373: you can skip the
geocode(...)
function from the ggmap
package and just assign lon
and lat
directly. You can even look up the coordinates for State College and substitute that if you like. Oddly enough, you might have to give a credit card number to Google if you want to use geocode()
so feel free to modify your code to avoid that step.
Project Gutenberg
- https://www.gutenberg.org/
- Project Gutenberg is a massive free online library
- full-text of almost 60,000 free ebooks (e.g., expired copyright)
- Jane Austin
- Charles Dickens
- Mark Twain
- Arthur Conan Doyle
- Oscar Wilde
- Bill Shakespeare
- (and loads more…)
- New on 3/29/2019–The American Bee Journal!
Macbeth Summary
Our “muse” for this first portion is the famous play MacBeth by William Shakespeare. If you aren’t familiar with the play, here is a quick summary:
Macbeth text data intake
- Pretty fast!
- How does it look?
- Q: Look closely, Can you read this?
- Q: How can we make the text more understandable?
macbeth_url <- "http://www.gutenberg.org/cache/epub/1129/pg1129.txt"
Macbeth_raw <- RCurl::getURL(macbeth_url)
# Macbeth_raw
Text as data
- Humans:
- excellent at understanding text
- not as good at storing text
- Computers:
- excellent at storing text
- not as good at understanding text
- Human + Computer??
- excellent at understanding & storing text?
- not so good at storing & understanding text?
- (depends on the human & the computer…?)
Back to Macbeth
- We need to split the one giant string of
Macbeth_raw
everywhere that we see the pattern \r\n
(marking end of line)
- Q: What class of object do we have now?
- Q: Is it better?
macbeth_tmp <- strsplit(Macbeth_raw, "\r\n")
str(macbeth_tmp)
List of 1
$ : chr [1:3193] "This Etext file is presented by Project Gutenberg, in" "cooperation with World Library, Inc., from their Library of the" "Future and Shakespeare CDROMS. Project Gutenberg often releases" "Etexts that are NOT placed in the Public Domain!!" ...
macbeth <- strsplit(Macbeth_raw, "\r\n")[[1]]
length(macbeth)
[1] 3193
head(macbeth)
[1] "This Etext file is presented by Project Gutenberg, in"
[2] "cooperation with World Library, Inc., from their Library of the"
[3] "Future and Shakespeare CDROMS. Project Gutenberg often releases"
[4] "Etexts that are NOT placed in the Public Domain!!"
[5] ""
[6] "*This Etext has certain copyright implications you should read!*"
The set up…
- Let’s say you enrolled in THEA 102 for a GA credit: Fundamentals of Acting
- The class will choose a play by Shakespeare and then everyone will volunteer for the various parts.
- There’s someone in the class you want to impress
- You want a part that is important,
- but you don’t want too many lines to memorize…
Manual text analysis
- How would you count up the lines if you just have a physical copy of the play?
- Q: For one character?
- Q: For all characters?
- Q: For all 37 of Shakespeare’s plays?
macbeth[295:310]
[1] ""
[2] "SCENE II."
[3] "A camp near Forres. Alarum within."
[4] ""
[5] "Enter Duncan, Malcolm, Donalbain, Lennox, with Attendants,"
[6] "meeting a bleeding Sergeant."
[7] ""
[8] " DUNCAN. What bloody man is that? He can report,"
[9] " As seemeth by his plight, of the revolt"
[10] " The newest state."
[11] " MALCOLM. This is the sergeant"
[12] " Who like a good and hardy soldier fought"
[13] " 'Gainst my captivity. Hail, brave friend!"
[14] " Say to the King the knowledge of the broil"
[15] " As thou didst leave it."
[16] " SERGEANT. Doubtful it stood,"
Manual text analysis
- If you had to do it manually, you’d probably flip through the pages and make a tally (on a separate sheet) for each line
- Now that you know you can find Shakespeare’s plays through Project Gutenberg, you can probably access ebooks for all of them
- Q: What’s the approach?
- Q: Scrolling instead of “flipping” but otherwise same?
- Q: Something else?
Regular Expressions (RegEx)
- Humans are good at understanding text, but bad at storing lots of text and slow at searching
- Computers are good at storing & searching, but not good at “understanding” it.
- Regular expressions
- Computer stores the text (good at this)
- Human identifies pattern (good at this)
- Human translates pattern so computer can search for it (takes practice…)
- Computer searches the text (good at this)
- RStudio Cheatsheets: https://www.rstudio.com/resources/cheatsheets/
grep( )
& grepl( )
- these functions look for the needle (pattern) in the haystack (character vector)
- needle/pattern:
MACBETH"
with two leading spaces
- haystack/text:
macbeth
character vector object
- Q: What’s returned in each case?
- Are these Macbeth’s lines?
- All mentions of Macbeths’ name?
- Are we done?
- Q: How are we doing?
- Can we improve it further?
- Compare with Ctrl + F on the website?
macbeth_lines <- grep("MACBETH", macbeth)
# macbeth_lines <- grep("MACBETH", macbeth, value = TRUE)
# macbeth_lines <- grepl("MACBETH", macbeth)
length(macbeth_lines)
[1] 208
head(macbeth_lines)
[1] 218 228 230 433 443 466
identical(c(1:3), c(1L, 2L, 3L)) # are these the same?
identical(c(1:3), c(1, 2, 3)) # `identical` is PICKY
# are these the same?
identical(macbeth[grep("MACBETH", macbeth)],
macbeth[grepl("MACBETH", macbeth)])
Refining our RegEx
- How do you identify a speaker’s lines? (Humans are good at this, not computers)
- All CAPITAL LETTERS
- Two leading spaces
- ends with period
- Q: Close, but what’s wrong here?
macbeth_lines <- grep(" MACBETH.", macbeth, value = TRUE)
length(macbeth_lines)
[1] 147
head(macbeth_lines)
[1] " MACBETH, Thane of Glamis and Cawdor, a general in the King's"
[2] " MACBETH. So foul and fair a day I have not seen."
[3] " MACBETH. Speak, if you can. What are you?"
[4] " MACBETH. Stay, you imperfect speakers, tell me more."
[5] " MACBETH. Into the air, and what seem'd corporal melted"
[6] " MACBETH. Your children shall be kings."
Problem with period (.
)
- In RegEx, the period is a a metacharacter that matches ANY character
\\.
is required to literally search for periods in our RegEx pattern
\\
is called an “escape”
- escape is required for any part of a RegEx that has special meaning, but you just want to search literally
# anything with "MAC" and then another character
grep("MAC.", macbeth, value = TRUE) %>%
head()
[1] "MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES"
[2] "MACHINE READABLE COPIES OF THIS ETEXT, SO LONG AS SUCH COPIES"
[3] "WITH PERMISSION. ELECTRONIC AND MACHINE READABLE COPIES MAY BE"
[4] "THE TRAGEDY OF MACBETH"
[5] " MACBETH, Thane of Glamis and Cawdor, a general in the King's"
[6] " LADY MACBETH, his wife"
# MACBETH.
grep("MACBETH\\.", macbeth, value = TRUE) %>% head(20)
[1] " MACBETH. So foul and fair a day I have not seen."
[2] " MACBETH. Speak, if you can. What are you?"
[3] " MACBETH. Stay, you imperfect speakers, tell me more."
[4] " MACBETH. Into the air, and what seem'd corporal melted"
[5] " MACBETH. Your children shall be kings."
[6] " MACBETH. And Thane of Cawdor too. Went it not so?"
[7] " MACBETH. The Thane of Cawdor lives. Why do you dress me"
[8] " MACBETH. [Aside.] Glamis, and Thane of Cawdor!"
[9] " MACBETH. [Aside.] Two truths are told,"
[10] " MACBETH. [Aside.] If chance will have me King, why, chance may"
[11] " MACBETH. [Aside.] Come what come may,"
[12] " MACBETH. Give me your favor; my dull brain was wrought"
[13] " MACBETH. Till then, enough. Come, friends. Exeunt."
[14] " MACBETH. The service and the loyalty lowe,"
[15] " MACBETH. The rest is labor, which is not used for you."
[16] " MACBETH. [Aside.] The Prince of Cumberland! That is a step"
[17] " LADY MACBETH. \"They met me in the day of success, and I have"
[18] " LADY MACBETH. Thou'rt mad to say it!"
[19] " LADY MACBETH. Give him tending;"
[20] " MACBETH. My dearest love,"
Simple RegEx Tools
|
: alternation–search for a few specific alternatives
"MAC[B|D]"
would match all strings that include “MACB” OR “MACD”
[ ]
: character sets–square brackets define sets of characters to match
"MAC[C-Z]"
would match “MAC” followed by any capital letter from “C” through “Z”
^
or $
: anchors–search for pattern strictly at the beginning (^
) or end ($
) of string
"^ACT"
would match all lines strictly beginning with “ACT”
"MACBETH$"
would match all lines strictly ending with “MACBETH” (no punctuation)
?
or *
or +
: repetitions
"^ ?MAC"
would match if string begins with zero or one leading spaces followed by “MAC”
"^ *MAC"
would match if string begins with zero or more leading spaces followed by “MAC”
"^ +MAC"
would match if string begins with one or more leading spaces followed by “MAC”
"^ {3}MAC"
would match if string begins with exactly 3 leading spaces followed by “MAC”
# alternation with `|`
grep("MAC[B|D]", macbeth, value = TRUE) %>% head()
[1] "THE TRAGEDY OF MACBETH"
[2] " MACBETH, Thane of Glamis and Cawdor, a general in the King's"
[3] " LADY MACBETH, his wife"
[4] " MACDUFF, Thane of Fife, a nobleman of Scotland"
[5] " LADY MACDUFF, his wife"
[6] " MACBETH. So foul and fair a day I have not seen."
# "MAC" followed by any capital letter from "C" through "Z"
grep("MAC[C-Z]", macbeth, value = TRUE) %>% head(10)
[1] "MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES"
[2] "MACHINE READABLE COPIES OF THIS ETEXT, SO LONG AS SUCH COPIES"
[3] "WITH PERMISSION. ELECTRONIC AND MACHINE READABLE COPIES MAY BE"
[4] " MACDUFF, Thane of Fife, a nobleman of Scotland"
[5] " LADY MACDUFF, his wife"
[6] "WITH PERMISSION. ELECTRONIC AND MACHINE READABLE COPIES MAY BE"
[7] "WITH PERMISSION. ELECTRONIC AND MACHINE READABLE COPIES MAY BE"
[8] " MACDUFF. Was it so late, friend, ere you went to bed,"
[9] " MACDUFF. What three things does drink especially provoke?"
[10] " MACDUFF. I believe drink gave thee the lie last night."
# search for beginning of each act in the play (`^` goea at beginning)
grep("ACT", macbeth, value = TRUE) %>% head(10)
[1] "BREACH OF WARRANTY OR CONTRACT, INCLUDING BUT NOT LIMITED TO"
[2] "ACT I. SCENE I."
[3] "ACT II. SCENE I."
[4] "ACT III. SCENE I."
[5] "ACT IV. SCENE I."
[6] "ACT V. SCENE I."
grep("^ACT", macbeth, value = TRUE) %>% head(10)
[1] "ACT I. SCENE I." "ACT II. SCENE I." "ACT III. SCENE I." "ACT IV. SCENE I." "ACT V. SCENE I."
# strings strictly ending in "MACBETH" (`$` goes at end)
grep("MACBETH$", macbeth, value = TRUE) %>% head(10)
[1] "THE TRAGEDY OF MACBETH"
# repetitions
grep("^ ?MAC", macbeth, value = TRUE) %>% head() # zero or one leading spaces
[1] "MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES"
[2] "MACHINE READABLE COPIES OF THIS ETEXT, SO LONG AS SUCH COPIES"
grep("^ *MAC", macbeth, value = TRUE) %>% head() # zero or more leading spaces
[1] "MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES"
[2] "MACHINE READABLE COPIES OF THIS ETEXT, SO LONG AS SUCH COPIES"
[3] " MACBETH, Thane of Glamis and Cawdor, a general in the King's"
[4] " MACDUFF, Thane of Fife, a nobleman of Scotland"
[5] " MACBETH. So foul and fair a day I have not seen."
[6] " MACBETH. Speak, if you can. What are you?"
grep("^ +MAC", macbeth, value = TRUE) %>% head() # one or more leading spaces
[1] " MACBETH, Thane of Glamis and Cawdor, a general in the King's"
[2] " MACDUFF, Thane of Fife, a nobleman of Scotland"
[3] " MACBETH. So foul and fair a day I have not seen."
[4] " MACBETH. Speak, if you can. What are you?"
[5] " MACBETH. Stay, you imperfect speakers, tell me more."
[6] " MACBETH. Into the air, and what seem'd corporal melted"
grep("^ {2}MAC", macbeth, value = TRUE) %>% head() # exaclty two leading spaces
[1] " MACBETH, Thane of Glamis and Cawdor, a general in the King's"
[2] " MACDUFF, Thane of Fife, a nobleman of Scotland"
[3] " MACBETH. So foul and fair a day I have not seen."
[4] " MACBETH. Speak, if you can. What are you?"
[5] " MACBETH. Stay, you imperfect speakers, tell me more."
[6] " MACBETH. Into the air, and what seem'd corporal melted"
grep("^ {3}MAC", macbeth, value = TRUE) %>% head() # exactly three leading spaces
character(0)
Speaker frequency in Macbeth
- recall: you wanted an important character, but not a huge number of lines to memorize
- want a decent number of lines
- “important” could mean involved for most of the play (not killed immediately)
- Q: Who meets our criteria?
Macbeth <- grepl(" MACBETH\\.", macbeth)
Macduff <- grepl(" MACDUFF\\.", macbeth)
LadyMacbeth <- grepl(" LADY MACBETH\\.", macbeth)
LadyMacduff <- grepl(" LADY MACDUFF\\.", macbeth)
Banquo <- grepl(" BANQUO\\.", macbeth)
Duncan <- grepl(" DUNCAN\\.", macbeth)
speaker_freq <- data.frame(Macbeth, Macduff, LadyMacbeth, LadyMacduff, Banquo, Duncan) %>%
mutate(line = 1:length(macbeth)) %>%
gather(key = "character", value = "speak", -line) %>%
mutate(speak = as.numeric(speak)) %>%
filter(line > 218 & line < 3172)
glimpse(speaker_freq)
Observations: 17,718
Variables: 3
$ line [3m[38;5;246m<int>[39m[23m 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238,…
$ character [3m[38;5;246m<chr>[39m[23m "Macbeth", "Macbeth", "Macbeth", "Macbeth", "Macbeth", "Macbeth", "Macbeth", "Macbeth", "Macbeth", …
$ speak [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# find the acts that subdivide the play
acts_idx <- grep("^ACT ", macbeth)
acts_labels <- str_extract(macbeth[acts_idx], "^ACT [I|V]+") # [I|V]+ I, II, III, IV, V
acts <- data.frame(line = acts_idx, labels = acts_labels)
# plot speaker frequencies
speaker_freq %>%
ggplot(aes(x = line, y = speak)) +
geom_smooth(aes(color = character), method = "loess", se = 0) +
geom_vline(xintercept = acts_idx, color = "darkgray", lty = 3) +
geom_text(data = acts, aes(y = 0.085, label = labels),
hjust = "left", color = "darkgray") +
ylim(c(0, NA)) +
xlab("Line Number") +
ylab("Proportion of Speeches")

Further Analysis
- Getting closer… but we should probably open up the search a little
- Problem: We really only grabbed the first part of each line…
- What if we accidentally choose a character prone to long speeches?
- Might be too much to memorize… we need better information
- Q: How do you suggest we access entire lines for each part in the play?
- How would you do it “by eye”?
- How can you translate that into RegEx?
Indentifying lines & speakers
- Q: What does this RegEx pattern get for us?
"\r\n {2}[A-Z| |0-9]*\\. "
- Q: What’s the difference between the result for
strsplit
stringr::extract
stringr::extract_all
- Q: Why
[-1]
at the end of strsplit
command?
lines <- strsplit(Macbeth_raw, "\r\n {2}[A-Z| |0-9]*\\. ")[[1]][-1]
speakers <- stringr::str_extract_all(Macbeth_raw, " {2}[A-Z| |0-9]*\\. ")[[1]]
AllMacbeth <- data.frame(speakers, lines)
Clean up the lines
- Q: What’s left to clean up?
head(AllMacbeth)
Clean up the lines
gsub
is handy for “find & replace”
- if you want to simply cut some pattern out of the text
- find it with the proper RegEx
- replace with “” (that’s a pattern with nothing in it!)
AllMacbeth <-
AllMacbeth %>%
mutate(lines = gsub(pattern = "[\r|\n]*", replacement = "", x = lines))
head(AllMacbeth)
More clean up
- The end of each Act seems to include something like the following text…
<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM
SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS
PROVIDED BY PROJECT GUTENBERG ETEXT OF CARNEGIE MELLON UNIVERSITY
WITH PERMISSION. ELECTRONIC AND MACHINE READABLE COPIES MAY BE
DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS
PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
COMMERCIALLY. PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>
- Q: Where do you think that “junk” landed due to our approach so far?
- Q: What should we do?
- Q: Is there more “junk” like this?
Another gsub
- we’re not going to clean up everything, but we’ll do this one more…
- Q: Consider RegEx pattern:
"<<.*>>"
- What’s it do?
- Do you think this will work?
- Do we need an escape for the special characters?
\\
junk <- grepl(pattern = "<<.*>>", x = AllMacbeth$lines)
# Before
AllMacbeth %>%
filter(junk) %>%
select(speakers, lines)
# Correction
AllMacbeth <-
AllMacbeth %>%
mutate(lines = gsub(pattern = "<<.*>>", replacement = "", x = lines))
# After
AllMacbeth %>%
filter(junk) %>%
select(speakers, lines)
Time to take a look!
- Q: How might we use our result so far to count how many words are attributed to each character?
First attempt: Counting words
- Q: How’s it look? >- Q: How would you debug this?
countWords <- function(line) {
return(length(strsplit(x = line, split = "\\s+")))
}
AllMacbeth$nWords <- sapply(X = AllMacbeth$lines, FUN = countWords)
head(AllMacbeth)
Debugging our cleanup
- Q: How’s it look? >- Q: What type of object is this?
strsplit(x = AllMacbeth$lines[1], split = "\\s+")
[[1]]
[1] "When" "shall" "we" "three" "meet" "again?" "In" "thunder,"
[9] "lightning," "or" "in" "rain?"
Debugging our cleanup
countWords <- function(line) {
return(length(strsplit(x = line, split = "\\s+")[[1]]))
}
AllMacbeth$nWords <- sapply(X = AllMacbeth$lines, FUN = countWords)
head(AllMacbeth)
Choosing your character
- Recall: You want to
- impress someone in THEA 102
- choose an “important” character
- not too much to memorize
- Q: What are your priorities?
- Most lines?
- Longest lines?
- Who has the most total content to memorize?
- Something else?
- Q: How should we figure it out from here?
# Longest individual lines
AllMacbeth %>%
select(speakers, nWords, lines) %>%
arrange(desc(nWords))
Character analysis
- Most lines?
- Longest lines?
- Who has the most total content to memorize?
- Something else?
MacbethSummary <-
AllMacbeth %>%
group_by(speakers) %>%
summarise(lines = n(),
totalWords = sum(nWords),
wordsPerLine = totalWords/lines)
# Most words per line
MacbethSummary %>%
arrange(desc(wordsPerLine)) %>%
head(10)
# Most total content
MacbethSummary %>%
arrange(desc(totalWords)) %>%
head(10)
Who do you choose?
MyLines <-
AllMacbeth %>%
filter(grepl(" LADY MACBETH\\.", speakers))
head(MyLines)
Text Mining Macbeth
- I want to dig a little deeper…
- (translation: I found more 2 min Macbeth videos!)
- Q: What would be some interesting questions to explore when text mining Macbeth?
Text Mining Macbeth
- Shakespearean plays are believed to share a common literary structure
- Each ACT in the play has a different, but important role in the drama
- Problem people talk weird in Macbeth (I’m kidding… sort of)
- Let’s do some text analysis of a “modern” translation of the play
- http://www.nosweatshakespeare.com/shakespeares-plays/modern-macbeth/
- Q: How would you solve this so we could process the text?
Text Analysis
- Now, I’ve got a whole directory full of text files.
- I was at least smart enough to name them in a structured way:
- dedicated directory location
- useful naming convention: “act1_scene1.txt”
readtext
can read all the “*.txt" files in the requested directory
docvarsfrom
indicates that the file name has some variables I want
dvsep
indicates that they are separated by an underscore
- result is a data frame
require(readtext)
ModernMacbeth <-
readtext::readtext(file = "/Users/mattbeckman/Documents/GitHub/Teaching/STAT-380/2019 Spring/ClassNotes/15-mdsr/plain_Macbeth/*.txt", docvarsfrom = "filenames", dvsep = "_") %>%
as.tibble()
head(ModernMacbeth)
Corpus
- we’re often interested in many documents when text mining
- a corpus is a collection of many documents
- Q: In our example:
- what is the corpus?
- what are the documents?
- Q: Describe a “corpus” and the associated “documents” for
- a different analysis of Shakespeare?
- some other author?
- other examples?
Modern Macbeth Corpus
- before we can really dig into our study of the differences among acts in Macbeth, we need to build up a couple text analysis tools
- we’ll use the full text of Modern Macbeth to get started
Tidy Text Format
- Recall: tidy data has a specific structure
- each case is a row
- each variable is a column
- Tidy text could similarly be described as a table with one token-per-row
- a token is a meaningful unit of text useful for analysis
- word (very common)
- n-gram (sequence of words)
- sentence
- paragraph
- tokenization is the process of splitting text into tokens
Tokenization step
- let’s tokenize Modern Macbeth
- we’ll use some tools from the
tidytext
package
unnest_tokens
breaks the text into individual tokens… automatically
- it keeps the other columns–act & scene here
- strips punctuation
- converts to lowercase (default arg:
to_lower = TRUE
)
require(tidytext)
ModernMacbeth %>%
select(text, act = docvar1, scene = docvar2) %>%
unnest_tokens(output = word, input = text) # single word tokenization
# unnest_tokens(output = word, input = text, token = "ngrams", n = 3) # n-gram tokenization
Bag of words
- We’re using a simple approach to text analysis called bag of words
- Bag of words preserves term frequencies, but disregards word order, grammar, etc
ModernMacbeth_tidy <-
ModernMacbeth %>%
select(text, act = docvar1, scene = docvar2) %>%
unnest_tokens(output = word, input = text) # single word tokenization
Token frequency
- Q: What have we learned about Act 1?
- Q: What should we do next?
ModernMacbeth_tidy %>%
count(word, sort = TRUE)
Stop words
- stop words are words that are not useful for an analysis
- extremely common words like “the”, “and”, “to”, as shown
- words that aren’t useful in analysis like “accordingly”
- The contents of the stop word list matters a lot
tidytext
package includes a data set called stop_words
to get you started
- The stop words are different for each language
- You might add/modify the stop words based on specialized expertise with the context
- Q: What if we had used the original Shakespeare?
# native text
ModernMacbeth_tidy %>%
count(word, sort = TRUE)
# load stop word list (English)
data("stop_words") # from `tidytext` package
head(stop_words)
tail(stop_words)
Removing stop words
- filter out stop words
- equivalently, we could use an
anti_join
here
- it turns out Macbeth is a big deal and he’s all over the play
- we might consider removing “macbeth” from the bag of words
- Q: how about another suggestion?
- …maybe something that seems like it clearly should have been removed already??
- is our list of stop words really that bad??
# # `stop_words` has two columns: "word" and "lexicon
# stop_words <-
# rbind(stop_words,
# c("macbeth", "custom"))
#
# stop_words %>%
# filter(word == "it's")
#
# # we'll need to use RegEx to clean these up
# grep(pattern = "it.s", x = ModernMacbeth_tidy$word, value = TRUE)
# grep(pattern = "’", x = ModernMacbeth_tidy$word, value = TRUE) %>% head(30)
ModernMacbeth_tidy <-
ModernMacbeth_tidy %>%
# mutate(word = gsub(pattern = , replacement = , x = word)) %>%
filter(!(word %in% stop_words$word))
ModernMacbeth_tidy %>%
count(word, sort = TRUE)
Tidy text pipes to ggplot2
- Let’s make a histogram of word frequencies
- Q: What’s going on in Act 1 of Macbeth?
ModernMacbeth_tidy %>%
mutate(word = gsub(pattern = "’", replacement = "'", x = word)) %>%
count(word, sort = TRUE) %>%
filter(n > 20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()

Word clouds
- word clouds are another popular visualization of token frequency (analogous to our histogram)
- the
wordcloud
package uses base R graphics, but the with( )
function helps
- Q: What’s Macbeth about?
- What have we learned from this?
- What should we do next?
ModernMacbeth_tidy %>%
filter(!(word %in% stop_words$word)) %>%
count(word) %>%
with(., wordcloud(word, n, max.words = 45))

word clouds: Modern vs original
- I’ve made a helper function to prep the corpus
- I also added “said” to our stop word list
c(stopwords("english"), "said")
- should “macbeth” be added to the list?
AllMacbeth_tidy <-
AllMacbeth %>%
select(speakers, lines) %>%
unnest_tokens(output = word, input = lines) # single word tokenization
head(AllMacbeth_tidy)
# helper function for making wordclouds
macbeth_wordcloud <- function(Corpus_tidy, maxWords = 45) {
Corpus_tidy %>%
filter(!(word %in% stop_words$word)) %>%
count(word) %>%
with(., wordcloud(word, n, max.words = maxWords))
}
macbeth_wordcloud(Corpus = AllMacbeth_tidy)

macbeth_wordcloud(Corpus = ModernMacbeth_tidy)

Act by Act word clouds
- Q: Can we tell anything about the storyline?
- Q: How reliable are word clouds able to convey meaning of each Act?
macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act1"))

macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act2"))

macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act3"))

macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act4"))

macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act5"))

Recall
- Humans:
- excellent at understanding text
- not as good at storing text
- Computers:
- excellent at storing text
- not as good at understanding text
- Human + Computer??
- excellent at understanding & storing text?
- not so good at storing & understanding text?
- (depends on the human & the computer…?)
Sentiment
- token frequencies are OK, but words evoke emotion
- humans are (usually) excellent at understanding emotions
- word choice can indicate
- general positive/negative emotion
- specific emotions like surprise or disgust
- good news!
- people have compiled detailed word lists for this
tidytext
package includes several in the sentiments
data set
require(tidytext)
data("sentiments")
head(sentiments, 10)
tail(sentiments, 10)
Sentiment lexicons
sentiments
contains three general purpose lexicons
- AFINN–rating from -5 (very negative) to +5 (very positive)
- bing–positive/negative classification
- nrc–classification into categories
- positive
- negative
- fear
- sadness
tidytext
provides function get_sentiments( )
to choose a lexicon
- lots of words are quite neutral, so they can be excluded from sentiment lexicons
- the sentiments in these lexicons might be constructed by crowdsourcing or research, then validated using restaurant, movie, or amazon reviews
- Q: why might we hesitate to apply these sentiment lexicons to Shakespeare’s literature?
- Q: Any other concerns that might make it hard to capture sentiment?
- one-word tokens
- extremely large chunks of text (all of Macbeth)
get_sentiments("nrc") %>%
group_by(sentiment) %>%
summarise(N = n()) %>%
arrange(desc(N))
Sentiment analysis of Macbeth
- since we’re working with tidy data, we can use an
inner_join
for sentiment analysis
- this converts several “text mining” tasks into simple tidy data analysis tasks
- let’s investigate some common terms in Macbeth that suggest “anticipation”
nrc_anticipation <- get_sentiments("nrc") %>%
filter(sentiment == "anticipation")
ModernMacbeth_tidy %>%
inner_join(nrc_anticipation) %>%
count(word, sort = TRUE)
Joining, by = "word"
NA
Changes in Sentiment
- small sections of text may not have enough words to communicate sentiment
- extremely large sections might average out the sentiment we want to capture
- for Shakespeare’s original text, we might choose blocks of 80 lines or so
- for Modern Macbeth, we’ll choose whole scenes (perhaps a bit large/variable)
- we’ll calculate a metric to assess each scene
- from -100 to 100
- 100 means 100% positive sentiment
ModernMacbeth_sentiment <-
ModernMacbeth_tidy %>%
mutate(act_scene = paste(act, scene, sep = "_")) %>%
inner_join(get_sentiments("afinn")) %>%
group_by(act, scene) %>%
summarise(sentiment = mean(score, na.rm = TRUE))
Joining, by = "word"
head(ModernMacbeth_sentiment)
Plot changes in sentiment by Act
ModernMacbeth_sentiment %>%
rownames_to_column() %>%
mutate(rowname = parse_number(rowname)) %>%
ggplot(aes(x = rowname, y = sentiment, fill = act)) +
geom_bar(stat = "identity", show.legend = FALSE) +
ggtitle("Sentiment Analysis of each Act & Scene in Modern Macbeth")

NA
Common positive and negative words
- we have a tidy data frame with both sentiment and word
- we can analyze word counts that contribute to each sentiment
bing_word_counts <-
ModernMacbeth_tidy %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
Joining, by = "word"
bing_word_counts
Plotting word frequency by sentiment
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
Selecting by n

Wordclouds with feeling
- originally, we may have had trouble discerning what the text was about because so many of the common words don’t invoke sentiment (e.g, Macduff, Banquo, Malcolm)
- how about we revisit our wordcloud based on positive/negative sentiment!
- we’ll use
wordcloud::comparison.cloud( )
- note: we’ll need
acast()
to turn the data frame into a matrix
- Q: Anything strange about this word cloud?
ModernMacbeth_tidy %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 45))
require(reshape2)
ModernMacbeth_tidy %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"), max.words = 45)
Joining, by = "word"

Positive scenes in Macbeth
- we want to find the most positive scene in each act
- the ratios aren’t good
- in the actual text, things go pretty south pretty fast…
- Act 1, Scene 6: Duncan & Banquo on a leisurely ride through the countryside (genuinely pleasant)
- Act 2, Scene 1: Macbeth trying to act natural for Banquo (while planning to kill Duncan)
- Act 3, Scene 2: Macbeth wishing he was dead like Duncan
- Act 4, Scene 2: super long, I didn’t read it all
- Act 5, Scene 6: Malcolm & Macduff psych themselves up to kill Macbeth
bingpositive <- get_sentiments("bing") %>%
filter(sentiment == "positive")
wordcounts <- ModernMacbeth_tidy %>%
group_by(act, scene) %>%
summarize(words = n())
ModernMacbeth_tidy %>%
semi_join(bingpositive) %>%
group_by(act, scene) %>%
summarize(positivewords = n()) %>%
left_join(wordcounts, by = c("act", "scene")) %>%
mutate(ratio = positivewords / words) %>% # ratio of positive words
filter(scene != 0) %>%
top_n(1) %>%
ungroup()
Joining, by = "word"
Selecting by ratio
Analysis of word & document frequency
- a central question in natural language processing is how to quantify what a document is about
- one approach as we have discussed is simply term frequency
- this may draw attention common words regardless of importance
- stop word removal attempts to adjusting term frequency for common used words
- this is a fairly crude approach; perhaps we can do better
Document term matrices
- alternative approach for assessing term frequency is the document-term matrix
- sparse matrix describing a collection (corpus) of documents
- row for each document
- column for each term
- values are typically word count or “tf-idf” (see below)
- term frequency–\(tf(t, d)\)–is simply the frequency of term \(t\) in document \(d\)
- inverse document frequency–\(idf\)–is prevalence of term \(t\) across a set of documents \(D\)
- words that are common across many documents like “the” get low rank
- numerator is the number of documents in \(D\)
- denominator is the number of documents in \(D\) containing the term
- \(tf-idf\) multiplies the two
- frequency of a term adjusted for how rarely it is used
- roughly measures how important a word is to a document in a collection (or corpus) of documents
\[idf(t, D) = \text{ln}\frac{|D|}{|\{d \in D : t \in d\}|}\]
Term frequency in Acts of Macbeth
- What are the most commonly used words in each Act?
- term frequency
- \(tf-idf\)
- we will “start over” from the source (without removing stop words)
- each row in
act_words
is a unique word-act combination
n
is the number of times that word is used in that act
total
is the total words in the act
# tf within each act
act_words <-
ModernMacbeth %>%
select(text, act = docvar1, scene = docvar2) %>%
unnest_tokens(output = word, input = text) %>%
mutate(word = gsub(pattern = "’", replacement = "'", x = word)) %>%
count(act, word, sort = TRUE)
# total words in the act
total_words <-
act_words %>%
group_by(act) %>%
summarize(total = sum(n))
act_words <- left_join(act_words, total_words)
Joining, by = "act"
act_words
Term frequency in Macbeth
- let’s look at the distribution of
n / total
for each act
- this is the term frequency, \(tf\)
- very long right tails (those extremely common words)
- These plots exhibit similar distributions for all the acts
- many words that occur rarely
- fewer words that occur frequently
act_words %>%
ggplot(aes(n / total, fill = act)) +
geom_histogram(show.legend = FALSE) +
facet_wrap(~ act, ncol = 2, scales = "free_y") +
ggtitle("Term frequency in Acts of Macbeth")

Zipf’s law
- long-tailed distributions are common in any given corpus of natural language (like a book, or a lot of text from a website, or spoken words)
- a classic version of this relationship is called Zipf’s law, after George Zipf, a 20th century American linguist.
- Zipf’s law states that the frequency that a word appears is inversely proportional to its rank.
freq_by_rank <-
act_words %>%
group_by(act) %>%
mutate(rank = row_number(), # since already ordered by `n`
`term frequency` = n / total)
freq_by_rank
Visualization of Zipf’s law
- Zipf’s law visualized by plotting (on logarithmic scales)
- x-axis:
rank
of each word within the frequency table
- y-axis:
term frequency
on the y-axis
- result: inversely proportional relationship has a constant, negative slope
- this type of result is known as a power law https://en.wikipedia.org/wiki/Power_law
- relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities
- one quantity varies as a power of another
- area of a square vs length of side; 2 times length >> \(2^2\) times the area
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, color = act)) +
geom_line(size = 1.1, alpha = 0.8, show.legend = TRUE) +
scale_x_log10() +
scale_y_log10() +
ggtitle("Zipf's law for acts of Macbeth")

Investigating term frequency in Macbeth
- result wasn’t quite linear
- deviations at high rank are not uncommon for many kinds of language;
- a corpus often contains fewer rare words than predicted by a single power law.
- deviations at low rank are more unusual
- Modern Macbeth used a lower percentage of the most common words than many collections of language.
- analysis could be extended to compare authors, or other collections of text (sonnets vs tragedies)
rank_subset <- freq_by_rank %>%
filter(rank < 200,
rank > 10)
lm(log10(`term frequency`) ~ log10(rank), data = rank_subset)
Call:
lm(formula = log10(`term frequency`) ~ log10(rank), data = rank_subset)
Coefficients:
(Intercept) log10(rank)
-0.946 -0.909
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, color = act)) +
geom_abline(intercept = -0.946, slope = -0.909, color = "gray50", linetype = 2) +
geom_line(size = 1.1, alpha = 0.8, show.legend = TRUE) +
scale_x_log10() +
scale_y_log10() +
ggtitle("Zipf's law for Modern Macbeth")

tf-idf
- idea of tf-idf: find important words for the content of each document by
- decreasing the weight for commonly used words
- increasing the weight for words that are not used very much in the corpus of documents
- Calculating \(tf-idf\) attempts to find the words that are important (i.e., common) in a text, but not too common across all texts.
bind_tf_idf( )
function in the tidytext
package takes a tidy text dataset as input
- only need one row per token (term), per document
- column for the words, another identifying source document (act)
- we calculated
total
for each act previously, but don’t need it
- \(idf\)
- zero for words that are common across all documents (\(tf-idf\) is then zero too)
- higher for words that appear in fewer documents
act_words <-
act_words %>%
bind_tf_idf(word, act, n)
act_words
High tf-idf in Macbeth
- common to see proper nouns (people/places) from each document with the highest tf-idf
- interesting here to see other featured tokens (cauldron, hail, etc)!
- note: discreteness in the \(idf\) here because we have only 5 documents
- term that appears in only one of the five acts: \(idf = ln(5/1) = 1.6094\)
act_words %>%
select(-total) %>%
arrange(desc(tf_idf))
Visualizing high tf-idf words
act_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(act) %>%
top_n(10) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = act)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~act, ncol = 2, scales = "free") +
coord_flip()
Selecting by tf_idf

Document-term matrix
- DTM’s are usually very sparse since lots of term-document pairs don’t occur
- DTM’s don’t play nice with tidy tools & data frames don’t play nice with most text mining packages
tidytext
package has tools to easily convert between object types
tidy()
turns a DTM into a tidy data frame
cast()_dtm
turns a tidy, one-term-per-row, data frame into a DTM (for tools in tm
pkg)
cast()_dtm
(with a “T”) turns a tidy, one-term-per-row, data frame into a DTM (for tools in tm
pkg)
cast()_dfm
(with an “F”) turns a tidy, one-term-per-row, data frame into a DFM (for tools in quanteda
pkg)
Topic Modeling
- sometimes useful to find natural groups among documents of some collection/corpus (blog posts, wikipedia pages, etc)
- Topic modeling–method for unsupervised classification of documents
- find natural groups even if we’re not totally sure what to look for
- similar to clustering on numeric data
- Latent Dirichlet allocation (LDA) is a particularly popular method
Latent Dirichlet allocation (LDA)
- Latent Dirichlet allocation (LDA)
- every document is a mixture of topics
- every topic is a mixture of words
- LDA basically estimates the composition of both mixtures at the same time
- result allows content of documents to “overlap” rather than enforcing discrete groups…just like we do in natural language use.
- the
topicmodels
package will help us on our way
Preparing for LDA
- We’ll borrow a data set of articles from the Associated Press to illustrate some principles first.
- Don’t worry, back to Macbeth in a bit
- Note: We need the data as a DTM (not tidy form)
- Q: can you explain what’s happening at each line?
- Q: how many rows & columns are in our
AssociatedPress
matrix?
data("AssociatedPress")
AssociatedPress
<<DocumentTermMatrix (documents: 2246, terms: 10473)>>
Non-/sparse entries: 302031/23220327
Sparsity : 99%
Maximal term length: 18
Weighting : term frequency (tf)
Two-topic Latent Dirichlet allocation (LDA) model
- fitting the LDA model is the easy part with help of the
LDA( )
function
- note: we set an arbitrary seed in this case so we have same result each time… not necessary in general
- There are almost certainly more than two topics, but this is a start
- For a corpus of news articles like the AP data, we might expect the topics to be something like “politics” and “entertainment”… recall:
- each document could include a mix of topics (politics & entertainment)
- each topic has a mix of words
- politics topic might include words like ‘president’, ‘congress’, and ’government
- entertainment topic might include ‘movies’, ‘television’, and ‘actor’
- both topics might frequently include a word like ‘budget’
require(topicmodels)
ap_lda <-
AssociatedPress %>%
LDA(k = 2, control = list(seed = 1234))
ap_lda
A LDA_VEM topic model with 2 topics.
Per-topic-per-word probabilities
- want to extract the per-topic-per-word probabilities, \(\beta\) (“beta”), from the model
tidytext::tidy( )
provides method
- one-topic-per-term-per-row format
- For each combination, model computes probability of term being generated from that topic
ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
Top 10 per-topic-per-word probabilities (\(\beta\))
- Q: how might you characterize each topic now?
- note words characteristic of each topic
- note words common to both topics
ap_top_terms <- ap_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
ap_top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()

Words with greatest difference in \(\beta\) between topics
- estimated based on the log ratio of the two: \(\text{log}_2(\frac{\beta_2}{\beta_1})\)
- a log ratio is useful because it makes the difference symmetrical
- \(\beta_2\) being twice as large leads to a log ratio of 1
- \(\beta_1\) being twice as large results in -1
- we filter for relatively common words with \(\beta > 1/1000\) in at least one topic
- Q: what can we learn by inspecting the extremes?
beta_spread <- ap_topics %>%
mutate(topic = paste0("topic", topic)) %>%
spread(topic, beta) %>%
filter(topic1 > .001 | topic2 > .001) %>%
mutate(log_ratio = log2(topic2 / topic1)) %>%
arrange(desc(log_ratio))
beta_spread
Per-document classification
- LDA also models each document as a mixture of topics
- We examine the per-document-per-topic probabilities, \(\gamma\) (“gamma”)
- Gamma is proportion of words from that document, generated from that topic. For example,
- about 25% of the words in document 1 come from topic 1
- about 82% of the words in document 5 come from topic 2
scenes_gamma <- tidy(ap_lda, matrix = "gamma")
scenes_gamma %>%
arrange(document, gamma)
Let’s investigate a few interesting documents
- Document 3 is almost 50-50 between our two topics
- Document 6 is almost entirely topic 2
- Q: what does each document seem to be about?
- Q: does \(\gamma\) attributed to each topic make sense in these cases?
# investigate document 3 & 6
tidy(AssociatedPress) %>%
filter(document == 3) %>%
arrange(desc(count))
Modeling Topics as Acts in Macbeth…?
- Recall: our motivating question was to try and learn if we could use the content of Macbeth to expose structure among the 5 acts
- Results so far:
- wordclouds: “meh”
- sentiment analysis: super negative…no wonder Macbeth is called a “tragedy”
- tf-idf: hard to tell much… all the documents were written by the same person for the same purpose, so not surprised
- LDA: ???
- Goal: use LDA to model
k = 5
topics
- common to try a few different values of
k
(number of topics)
- start with 5 here because we know there are 5 acts in the play
# convert to Document Term Matrix
ModernMacbeth_DTM <-
ModernMacbeth %>%
mutate(act_scene = gsub(pattern = "\\.txt", replacement = "", x = doc_id)) %>%
unnest_tokens(output = word, input = text) %>%
mutate(word = gsub(pattern = "’", replacement = "'", x = word)) %>%
anti_join(stop_words, by = c("word" = "word")) %>%
count(act_scene, word, sort = TRUE) %>%
cast_dtm(document = act_scene, term = word, value = n)
# LDA topic model (5 topics)
scenes_lda <-
ModernMacbeth_DTM %>%
LDA(k = 5, control = list(seed = 380))
# Per-topic-per-word probabilities
scene_topics <- tidy(scenes_lda, matrix = "beta")
scene_topics
# Visualize top per-topic-per-word probabilities
top_terms <-
scene_topics %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()

# Per-document classification
scenes_gamma <- tidy(scenes_lda, matrix = "gamma")
scenes_gamma %>%
arrange(document, gamma)
# Topics vs Acts?
scenes_gamma <-
scenes_gamma %>%
separate(col = document, into = c("act", "scene"), sep = "_", convert = TRUE)
scenes_gamma %>%
ggplot(aes(factor(topic), gamma)) +
geom_boxplot() +
facet_wrap(~ act)

Scene alignment to topics
- Okay, the topics aren’t the acts
- turns out this enduring Shakespearean masterpiece needs a more nuanced interpretation
- Literary themes could align better to the Topics identified by our LDA…
- heres one attempt I found, with my wife’s description of each color:
- ambition (lavendar)
- fate (poppy)
- violence (Tiffany blue)
- nature & unnatural (sand)
- masculinity (Kendall gray)
---
title: "Text as Data"
subtitle: "MDSR Ch 15 <br> Text Mining with R (Silge & Robinson, 2019)"
output: 
  slidy_presentation: default
  html_notebook: default  
---


```{r Front Matter, echo=TRUE, message=FALSE, warning=FALSE, include=FALSE}
# clean up R environment
rm(list = ls())

# global options
knitr::opts_chunk$set(eval=TRUE, include=TRUE)
options(digits=4)

# load all packages here
library(mdsr)
library(tidyverse)
library(stringr)
library(readtext)
library(tidytext)
library(wordcloud)
library(reshape2)
library(topicmodels)


# inputs summary
# data("Macbeth_raw")   # MDSR backup version of Macbeth
data("stop_words")    # from `tidytext`
data("sentiments")    # from `tidytext`
data("AssociatedPress")  # from 'topicmodels'

```


# Agenda

- Regular expressions 
- Basic text analysis
    - corpora
    - word clouds
    - sentiment analysis
- Intermediate
    - term frequency x inverse document frequency (*tf-idf*)
    - Latent Dirichlet allocation (LDA; topic modeling)

#### Announcements

- Final Project update
- MDSR Ch 15 programming notebook assigned
- Beckman's office hour cancelled on Thurs 4/4 (see syllabus for other options)
- MDSR Ch 15 Exercises due 4/14 before midnight
- Lots of errata/tips this chapter

#### MDSR Ch 15 Errata / Tips

- Some sections don't require programming, but please still include the headers for navigation purposes
- p. 361: Note--`data("DataSciencePapers")` should appear BOTH in front matter (per style guide) AND as a code chunk in sequence with the other MDSR code so your section 15.2 results in the programming notebook will match the MDSR book rather than using new data queried from arXiv also loaded on p. 361 under the object name `DataSciencePapers`
- p. 365: your word cloud won't be identical and may even exclude "data" if it doesn't fit on the page
- p. 368: Wikipedia changed the tables all around. Use `Table[[4]]` when scraping; `Title` is now `Song`; the results won't exactly match but you'll be able to work it out.
- p. 369: change `25` to `15` (the result is quite a different message!)
- p. 370: you don't need a Twitter account, the provided credentials work... here they are:
    - consumer_key = "u2UthjbK6YHyQSp4sPk6yjsuV" 
    - consumer_secret = "sC4mjd2WME5nH1FoWeSTuSy7JCP5DHjNtTYU1X6BwQ1vPZ0j3v" 
    - access_token = "1365606414-7vPfPxStYNq6kWEATQlT8HZBd4G83BBcX4VoS9T" 
    - access_secret = "0hJq9KYC3eBRuZzJqSacmtJ4PNJ7tNLkGrQrVl00JHirs"
- p. 372: you'll need to load the `RSQLite` library
- p. 373: you can skip the `geocode(...)` function from the `ggmap` package and just assign `lon` and `lat` directly.  You can even look up the coordinates for State College and substitute that if you like.  Oddly enough, you might have to give a credit card number to Google if you want to use `geocode()` so feel free to modify your code to avoid that step.


# Road map

<center>

![https://www.tidytextmining.com/topicmodeling.html](tidytextFlow.png){width=95%}

</center>




# Project Gutenberg 

- <https://www.gutenberg.org/>
- Project Gutenberg is a massive free online library
- full-text of almost 60,000 free ebooks (e.g., expired copyright)
    - Jane Austin
    - Charles Dickens
    - Mark Twain
    - Arthur Conan Doyle
    - Oscar Wilde
    - **Bill Shakespeare**
    - (and loads more...)
    - New on 3/29/2019--The American Bee Journal!


# Macbeth Summary

Our "muse" for this first portion is the famous play **MacBeth** by William Shakespeare.  If you aren't familiar with the play, here is a quick summary:

- 9 minutes: <https://youtu.be/uzAujyWpK_s>
- 2 minutes (1): <https://youtu.be/wDpmZbfDTJI>
- 2 minutes (2): <https://vimeo.com/124082577>
- 2 minutes (3): <https://youtu.be/F5nlx2XzP-4?t=3>
- 2 minutes (4): <https://youtu.be/TKYrZSJ2rUo>


# Text as data 

- Macbeth text: <http://www.gutenberg.org/cache/epub/1129/pg1129.txt>
- Q: What are some of the challenges you would expect when working with text as data?


# Macbeth text data intake

- Pretty fast!
- How does it look?
    - Q: Look closely, Can you read this?
    - Q: How can we make the text more understandable?

```{r}
macbeth_url <- "http://www.gutenberg.org/cache/epub/1129/pg1129.txt"
Macbeth_raw <- RCurl::getURL(macbeth_url)
# Macbeth_raw
```


# Text as data

- Humans: 
    - excellent at *understanding* text
    - not as good at *storing* text
- Computers: 
    - excellent at *storing* text
    - not as good at *understanding* text 
- Human + Computer??
    - excellent at understanding & storing text?
    - not so good at storing & understanding text?
    - (depends on the human & the computer...?)


# Back to Macbeth

- We need to split the one giant string of `Macbeth_raw` everywhere that we see the pattern `\r\n` (marking end of line)
- Q: What class of object do we have now?
- Q: Is it better?


```{r}
macbeth_tmp <- strsplit(Macbeth_raw, "\r\n")
str(macbeth_tmp)


macbeth <- strsplit(Macbeth_raw, "\r\n")[[1]]
length(macbeth)

head(macbeth)
```


# The set up...

- Let's say you enrolled in THEA 102 for a GA credit: Fundamentals of Acting  
- The class will choose a play by Shakespeare and then everyone will volunteer for the various parts.  
- There's someone in the class you want to impress 
    - You want a part that is important, 
    - but you don't want *too* many lines to memorize...


# Manual text analysis

- How would you count up the lines if you just have a physical copy of the play?
    - Q: For one character?
    - Q: For all characters?
    - Q: For all 37 of Shakespeare's plays?

```{r}
macbeth[295:310]

```


# Manual text analysis

- If you **had** to do it manually, you'd probably flip through the pages and make a tally (on a separate sheet) for each line
- Now that you know you can find Shakespeare's plays through Project Gutenberg, you can probably access ebooks for all of them
    - Q: What's the approach?
    - Q: Scrolling instead of "flipping" but otherwise same?
    - Q: Something else?


# Regular Expressions (RegEx)

- Humans are good at understanding text, but bad at storing lots of text and slow at searching
- Computers are good at storing & searching, but not good at "understanding" it.
- **Regular expressions**
    - Computer stores the text (good at this)
    - Human identifies pattern (good at this)
    - Human translates pattern so computer can search for it (takes practice...)
    - Computer searches the text (good at this)
- RStudio Cheatsheets: <https://www.rstudio.com/resources/cheatsheets/>


# `grep( )` & `grepl( )`

- these functions look for the needle (pattern) in the haystack (character vector)
    - needle/pattern: `MACBETH"` with two leading spaces
    - haystack/text:  `macbeth` character vector object
- Q: What's returned in each case?
    - Are these Macbeth's lines?
    - All mentions of Macbeths' name?
    - Are we done?
- Q: How are we doing?
    - Can we improve it further?
    - Compare with Ctrl + F on the website?


```{r}
macbeth_lines <- grep("MACBETH", macbeth)
# macbeth_lines <- grep("MACBETH", macbeth, value = TRUE)
# macbeth_lines <- grepl("MACBETH", macbeth)

length(macbeth_lines)
head(macbeth_lines)
```


```{r eval=FALSE}
identical(c(1:3), c(1L, 2L, 3L))  # are these the same?
identical(c(1:3), c(1, 2, 3))     # `identical` is PICKY

# are these the same?
identical(macbeth[grep("MACBETH", macbeth)],   
          macbeth[grepl("MACBETH", macbeth)])
```

<!-- Day2 -->


# Refining our RegEx

- How do you identify a speaker's lines? (Humans are good at this, not computers)
    - All CAPITAL LETTERS
    - Two leading spaces
    - ends with period
- Q: Close, but what's wrong here?

```{r}
macbeth_lines <- grep("  MACBETH.", macbeth, value = TRUE)

length(macbeth_lines)
head(macbeth_lines)
```


# Problem with period (`.`)

- In RegEx, the period is a a metacharacter that matches ANY character
- `\\.` is required to literally search for periods in our RegEx pattern
    - `\\` is called an "escape" 
    - escape is required for any part of a RegEx that has special meaning, but you just want to search literally


```{r}
# anything with "MAC" and then another character
grep("MAC.", macbeth, value = TRUE) %>%
  head()

# MACBETH.
grep("MACBETH\\.", macbeth, value = TRUE) %>% head(20)

```

# Simple RegEx Tools

- `|`: alternation--search for a few specific alternatives
    - `"MAC[B|D]"` would match all strings that include "MACB" OR "MACD"
- `[ ]`: character sets--square brackets define sets of characters to match
    - `"MAC[C-Z]"` would match "MAC" followed by any capital letter from "C" through "Z"
- `^` or `$`: anchors--search for pattern strictly at the beginning (`^`) or end (`$`) of string
    - `"^ACT"` would match all lines strictly beginning with "ACT"
    - `"MACBETH$"` would match all lines strictly ending with "MACBETH" (no punctuation)
- `?` or `*` or `+`: repetitions
    - `"^ ?MAC"` would match if string begins with zero or one leading spaces followed by "MAC"
    - `"^ *MAC"` would match if string begins with zero or more leading spaces followed by "MAC"
    - `"^ +MAC"` would match if string begins with one or more leading spaces followed by "MAC"
    - `"^ {3}MAC"` would match if string begins with exactly 3 leading spaces followed by "MAC"

```{r}
# alternation with `|`
grep("MAC[B|D]", macbeth, value = TRUE) %>% head()

# "MAC" followed by any capital letter from "C" through "Z"
grep("MAC[C-Z]", macbeth, value = TRUE) %>% head(10)

# search for beginning of each act in the play (`^` goea at beginning)
grep("ACT", macbeth, value = TRUE) %>% head(10)
grep("^ACT", macbeth, value = TRUE) %>% head(10)

# strings strictly ending in "MACBETH" (`$` goes at end)
grep("MACBETH$", macbeth, value = TRUE) %>% head(10)

# repetitions
grep("^ ?MAC", macbeth, value = TRUE) %>% head()  # zero or one leading spaces
grep("^ *MAC", macbeth, value = TRUE) %>% head()  # zero or more leading spaces
grep("^ +MAC", macbeth, value = TRUE) %>% head()  # one or more leading spaces
grep("^ {2}MAC", macbeth, value = TRUE) %>% head()  # exaclty two leading spaces
grep("^ {3}MAC", macbeth, value = TRUE) %>% head()  # exactly three leading spaces

```

# Speaker frequency in Macbeth 

- recall: you wanted an important character, but not a huge number of lines to memorize
    - want a decent number of lines
    - "important" could mean involved for most of the play (not killed immediately)
    - Q: Who meets our criteria?
    

```{r}
Macbeth <- grepl("  MACBETH\\.", macbeth)
Macduff <- grepl("  MACDUFF\\.", macbeth)
LadyMacbeth <- grepl("  LADY MACBETH\\.", macbeth)
LadyMacduff <- grepl("  LADY MACDUFF\\.", macbeth)
Banquo <- grepl("  BANQUO\\.", macbeth)
Duncan <- grepl("  DUNCAN\\.", macbeth)


speaker_freq <- data.frame(Macbeth, Macduff, LadyMacbeth, LadyMacduff, Banquo, Duncan) %>%
  mutate(line = 1:length(macbeth)) %>%
  gather(key = "character", value = "speak", -line) %>%
  mutate(speak = as.numeric(speak)) %>%
  filter(line > 218 & line < 3172)

glimpse(speaker_freq)
```



```{r}
# find the acts that subdivide the play
acts_idx <- grep("^ACT ", macbeth)
acts_labels <- str_extract(macbeth[acts_idx], "^ACT [I|V]+")  # [I|V]+ I, II, III, IV, V
acts <- data.frame(line = acts_idx, labels = acts_labels)

# plot speaker frequencies
speaker_freq %>%
  ggplot(aes(x = line, y = speak)) + 
  geom_smooth(aes(color = character), method = "loess", se = 0) + 
  geom_vline(xintercept = acts_idx, color = "darkgray", lty = 3) + 
  geom_text(data = acts, aes(y = 0.085, label = labels), 
            hjust = "left", color = "darkgray") + 
  ylim(c(0, NA)) + 
  xlab("Line Number") + 
  ylab("Proportion of Speeches")
```



# Further Analysis

- Getting closer... but we should probably open up the search a little
- **Problem**: We really only grabbed the first part of each line... 
    - What if we accidentally choose a character prone to long speeches? 
    - Might be too much to memorize... we need better information
- Q: How do you suggest we access *entire lines* for each part in the play?
    - How would you do it "by eye"?
    - How can you translate that into RegEx?

# Indentifying lines & speakers

- Q: What does this RegEx pattern get for us?  `"\r\n {2}[A-Z| |0-9]*\\. "`
- Q: What's the difference between the result for
    - `strsplit` 
    - `stringr::extract`
    - `stringr::extract_all`
- Q: Why `[-1]` at the end of `strsplit` command?


```{r}
lines <- strsplit(Macbeth_raw, "\r\n {2}[A-Z| |0-9]*\\. ")[[1]][-1]
speakers <- stringr::str_extract_all(Macbeth_raw, " {2}[A-Z| |0-9]*\\. ")[[1]]

AllMacbeth <- data.frame(speakers, lines)

```

# Clean up the lines

- Q: What's left to clean up?
    - Suggestions?

```{r}
head(AllMacbeth)
```


# Clean up the lines

- `gsub` is handy for "find & replace"
- if you want to simply cut some pattern out of the text
    - find it with the proper RegEx
    - replace with "" (that's a pattern with nothing in it!)

```{r}
AllMacbeth <- 
  AllMacbeth %>%
  mutate(lines = gsub(pattern = "[\r|\n]*", replacement = "", x = lines))

head(AllMacbeth)
```

# More clean up

- The end of each Act seems to include something like the following text...

`<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM`
`SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS`
`PROVIDED BY PROJECT GUTENBERG ETEXT OF CARNEGIE MELLON UNIVERSITY`
`WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE`
`DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS`
`PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED`
`COMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY`
`SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>`

- Q: Where do you think that "junk" landed due to our approach so far? 
- Q: What should we do?
- Q: Is there more "junk" like this?


# Another `gsub`

- we're not going to clean up *everything*, but we'll do this one more...
- Q: Consider RegEx pattern: `"<<.*>>"`
    - What's it do?
    - Do you think this will work?
    - Do we need an escape for the special characters? `\\`

```{r}
junk <- grepl(pattern = "<<.*>>", x = AllMacbeth$lines)

# Before
AllMacbeth %>%
  filter(junk) %>%
  select(speakers, lines)

# Correction
AllMacbeth <- 
  AllMacbeth %>%
  mutate(lines = gsub(pattern = "<<.*>>", replacement = "", x = lines))


# After
AllMacbeth %>%
  filter(junk) %>%
  select(speakers, lines)
```

<!-- Day3 -->

# Time to take a look!

- Q: How might we use our result so far to count how many words are attributed to each character?


# First attempt: Counting words

- Q: How's it look? 
>- Q: How would you debug this?

```{r}
countWords <- function(line) {
  return(length(strsplit(x = line, split = "\\s+")))
}

AllMacbeth$nWords <- sapply(X = AllMacbeth$lines, FUN = countWords)

head(AllMacbeth)

```

# Debugging our cleanup

- Q: How's it look?
>- Q: What type of object is this?

```{r}
strsplit(x = AllMacbeth$lines[1], split = "\\s+")

```


# Debugging our cleanup

- Q: How's it look?

```{r}
countWords <- function(line) {
  return(length(strsplit(x = line, split = "\\s+")[[1]]))
}

AllMacbeth$nWords <- sapply(X = AllMacbeth$lines, FUN = countWords)

head(AllMacbeth)
```



# Choosing your character

- Recall: You want to 
    - impress someone in THEA 102
    - choose an "important" character
    - not too much to memorize
- Q: What are your priorities?
    - Most lines?
    - Longest lines?
    - Who has the most total content to memorize?
    - Something else?
- Q: How should we figure it out from here?

```{r}
# Longest individual lines 
AllMacbeth %>%
  select(speakers, nWords, lines) %>%
  arrange(desc(nWords)) 

```

# Character analysis

- Most lines?
- Longest lines?
- Who has the most total content to memorize?
- Something else?

```{r}
MacbethSummary <- 
  AllMacbeth %>% 
  group_by(speakers) %>%
  summarise(lines = n(), 
            totalWords = sum(nWords), 
            wordsPerLine = totalWords/lines)


# Most words per line
MacbethSummary %>%
  arrange(desc(wordsPerLine)) %>%
  head(10)


# Most total content
MacbethSummary %>%
  arrange(desc(totalWords)) %>%
  head(10)

```


# Who do you choose?

- Let's get your lines!

```{r}
MyLines <- 
  AllMacbeth %>% 
  filter(grepl("  LADY MACBETH\\.", speakers))

head(MyLines)
```




# Text Mining Macbeth

- I want to dig a little deeper... 
- (translation: I found more 2 min Macbeth videos!)
- **Q: What would be some interesting questions to explore when text mining Macbeth?**


# Text Mining Macbeth

- Shakespearean plays are believed to share a common literary structure
- Each ACT in the play has a different, but important role in the drama
- **Problem** people talk weird in Macbeth (I'm kidding... sort of)
- Let's do some text analysis of a "modern" translation of the play
- <http://www.nosweatshakespeare.com/shakespeares-plays/modern-macbeth/>
- Q: How would you solve this so we could process the text?


# Sidebar "The General Problem"

- I punted.  
- I was busy with DataFest, so I took the sloppy way out
- I made a directory with every scene in the play as a text file
    - I thought about solving the "general problem"
    - I decided to "pass the salt" and show a favorite comic about it instead
- Q: Why am I claiming that my approach is "sloppy"?
- Q: What would be a more elegant solution? 

![https://m.xkcd.com/974/](generalProblem_xkcd954.png)


# Text Analysis

- Now, I've got a whole directory full of text files.
- I was at least smart enough to name them in a structured way: 
    - dedicated directory location
    - useful naming convention: "act1_scene1.txt"
- `readtext` can read all the "*.txt" files in the requested directory
    - `docvarsfrom` indicates that the file name has some variables I want
    - `dvsep` indicates that they are separated by an underscore
- result is a data frame 


```{r}
require(readtext)

ModernMacbeth <- 
  readtext::readtext(file = "/Users/mattbeckman/Documents/GitHub/Teaching/STAT-380/2019 Spring/ClassNotes/15-mdsr/plain_Macbeth/*.txt", docvarsfrom = "filenames", dvsep = "_") %>%
  as.tibble()

head(ModernMacbeth)

```

<!-- Day4 -->

# Corpus

- we're often interested in *many documents* when text mining 
- a **corpus** is a collection of many documents
- Q: In our example: 
    - what is the corpus?
    - what are the documents?
- Q: Describe a "corpus" and the associated "documents" for
    - a different analysis of Shakespeare?
    - some other author?
    - other examples? 


# Modern Macbeth Corpus

- before we can really dig into our study of the differences among acts in Macbeth, we need to build up a couple text analysis tools
- we'll use the full text of Modern Macbeth to get started



# Tidy Text Format

- Recall: tidy data has a specific structure
    - each case is a row
    - each variable is a column
- **Tidy text** could similarly be described as a table with **one token-per-row**
- a **token** is a meaningful unit of text useful for analysis
    - word (very common)
    - n-gram (sequence of words)
    - sentence
    - paragraph
- **tokenization** is the process of splitting text into tokens

# Tokenization step

- let's tokenize Modern Macbeth
- we'll use some tools from the `tidytext` package
- `unnest_tokens` breaks the text into individual tokens... automatically
    - it keeps the other columns--act & scene here
    - strips punctuation
    - converts to lowercase (default arg: `to_lower = TRUE`)


```{r}
require(tidytext)

ModernMacbeth %>%
  select(text, act = docvar1, scene = docvar2) %>%
  unnest_tokens(output = word, input = text)         # single word tokenization
  # unnest_tokens(output = word, input = text, token = "ngrams", n = 3)   # n-gram tokenization
```

# Bag of words

- We're using a simple approach to text analysis called bag of words
- **Bag of words** preserves term frequencies, but disregards word order, grammar, etc

```{r}
ModernMacbeth_tidy <- 
  ModernMacbeth %>%
  select(text, act = docvar1, scene = docvar2) %>%
  unnest_tokens(output = word, input = text)         # single word tokenization
```


# Token frequency

- Q: What have we learned about Act 1?
- Q: What should we do next?

```{r}
ModernMacbeth_tidy %>%
  count(word, sort = TRUE)
```


# Stop words

- **stop words** are words that are not useful for an analysis
    - extremely common words like "the", "and", "to", as shown
    - words that aren't useful in analysis like "accordingly"
- The contents of the stop word list matters a lot
    - `tidytext` package includes a data set called `stop_words` to get you started
    - The stop words are different for each language
    - You might add/modify the stop words based on specialized expertise with the context 
- Q: What if we had used the original Shakespeare?

```{r}
# native text
ModernMacbeth_tidy %>%
  count(word, sort = TRUE)

# load stop word list (English)
data("stop_words")   # from `tidytext` package

head(stop_words)
tail(stop_words)
```

# Removing stop words

- filter out stop words
- equivalently, we could use an `anti_join` here
- it turns out Macbeth is a big deal and he's all over the play 
    - we might consider removing "macbeth" from the bag of words
    - Q: how about another suggestion?
        - ...maybe something that seems like it clearly should have been removed already??
        - is our list of stop words really that bad??


```{r}
# # `stop_words` has two columns: "word" and "lexicon
# stop_words <-
#   rbind(stop_words,           
#         c("macbeth", "custom"))
# 
# stop_words %>%
#   filter(word == "it's")
# 
# # we'll need to use RegEx to clean these up
# grep(pattern = "it.s", x = ModernMacbeth_tidy$word, value = TRUE)
# grep(pattern = "’", x = ModernMacbeth_tidy$word, value = TRUE) %>% head(30)


ModernMacbeth_tidy <- 
  ModernMacbeth_tidy %>%
  # mutate(word = gsub(pattern = , replacement = , x = word)) %>%
  filter(!(word %in% stop_words$word))

ModernMacbeth_tidy %>%
  count(word, sort = TRUE)
```


# Tidy text pipes to `ggplot2`

- Let's make a histogram of word frequencies
- Q: What's going on in Act 1 of Macbeth?


```{r}
ModernMacbeth_tidy %>%
  mutate(word = gsub(pattern = "’", replacement = "'", x = word)) %>%
  count(word, sort = TRUE) %>%
  filter(n > 20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()
```



# Word clouds

- **word clouds** are another popular visualization of token frequency (analogous to our histogram)
- the `wordcloud` package uses base R graphics, but the `with( )` function helps
- Q: What's Macbeth about?
    - What have we learned from this?
    - What should we do next?

```{r}
ModernMacbeth_tidy %>%
  filter(!(word %in% stop_words$word)) %>%
  count(word) %>%
  with(., wordcloud(word, n, max.words = 45))

```



# word clouds: Modern vs original

- I've made a helper function to prep the corpus 
- I also added "said" to our stop word list
    - `c(stopwords("english"), "said")`
    - should "macbeth" be added to the list?

```{r}
AllMacbeth_tidy <- 
  AllMacbeth %>%
  select(speakers, lines) %>%
  unnest_tokens(output = word, input = lines)     # single word tokenization

head(AllMacbeth_tidy)
```


```{r}
# helper function for making wordclouds
macbeth_wordcloud <- function(Corpus_tidy, maxWords = 45) {
  Corpus_tidy %>%
    filter(!(word %in% stop_words$word)) %>%
    count(word) %>%
    with(., wordcloud(word, n, max.words = maxWords))

}

macbeth_wordcloud(Corpus = AllMacbeth_tidy)
macbeth_wordcloud(Corpus = ModernMacbeth_tidy)


```


# Act by Act word clouds

- Q: Can we tell anything about the storyline?
- Q: How reliable are word clouds able to convey meaning of each Act?

```{r}

macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act1"))
macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act2"))
macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act3"))
macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act4"))
macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act5"))


```

<!-- Day5 -->

# Recall 

- Humans: 
    - excellent at *understanding* text
    - not as good at *storing* text
- Computers: 
    - excellent at *storing* text
    - not as good at *understanding* text 
- Human + Computer??
    - excellent at understanding & storing text?
    - not so good at storing & understanding text?
    - (depends on the human & the computer...?)

# Sentiment

- token frequencies are OK, but words evoke emotion
- humans are (usually) excellent at understanding emotions 
- word choice can indicate 
    - general positive/negative emotion
    - specific emotions like surprise or disgust
- good news!
    - people have compiled detailed word lists for this
    - `tidytext` package includes several in the `sentiments` data set

```{r}
require(tidytext)
data("sentiments")

head(sentiments, 10)
tail(sentiments, 10)
```


# Sentiment lexicons

- `sentiments` contains three general purpose lexicons
    - AFINN--rating from -5 (very negative) to +5 (very positive)
    - bing--positive/negative classification
    - nrc--classification into categories
        - positive
        - negative
        - fear
        - sadness
- `tidytext` provides function `get_sentiments( )` to choose a lexicon
- lots of words are quite neutral, so they can be excluded from sentiment lexicons
- the sentiments in these lexicons might be constructed by crowdsourcing or research, then validated using restaurant, movie, or amazon reviews
- Q: why might we hesitate to apply these sentiment lexicons to Shakespeare's literature?
- Q: Any other concerns that might make it hard to capture sentiment?
    - one-word tokens
    - extremely large chunks of text (all of Macbeth)

```{r}
get_sentiments("nrc") %>%
  group_by(sentiment) %>%
  summarise(N = n()) %>%
  arrange(desc(N))
```



# Sentiment analysis of Macbeth

- since we're working with tidy data, we can use an `inner_join` for sentiment analysis
- this converts several "text mining" tasks into simple tidy data analysis tasks
- let's investigate some common terms in Macbeth that suggest "anticipation"

```{r}
nrc_anticipation <- get_sentiments("nrc") %>%
  filter(sentiment == "anticipation")

ModernMacbeth_tidy %>%
  inner_join(nrc_anticipation) %>%
  count(word, sort = TRUE)
  
```

# Changes in Sentiment

- small sections of text may not have enough words to communicate sentiment
- extremely large sections might average out the sentiment we want to capture
- for Shakespeare's original text, we might choose blocks of 80 lines or so
- for Modern Macbeth, we'll choose whole scenes (perhaps a bit large/variable)
- we'll calculate a metric to assess each scene
    - from -100 to 100 
    - 100 means 100% positive sentiment

```{r}
ModernMacbeth_sentiment <- 
  ModernMacbeth_tidy %>%
  mutate(act_scene = paste(act, scene, sep = "_")) %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(act, scene) %>%
  summarise(sentiment = mean(score, na.rm = TRUE)) 

head(ModernMacbeth_sentiment)
```


# Plot changes in sentiment by Act

```{r}
ModernMacbeth_sentiment %>%
  rownames_to_column() %>%
  mutate(rowname = parse_number(rowname)) %>%
  ggplot(aes(x = rowname, y = sentiment, fill = act)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  ggtitle("Sentiment Analysis of each Act & Scene in Modern Macbeth")
  
```


# Common positive and negative words

- we have a tidy data frame with both sentiment and word
- we can analyze word counts that contribute to each sentiment

```{r}
bing_word_counts <- 
  ModernMacbeth_tidy %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts
```

# Plotting word frequency by sentiment

```{r}
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
```


# Wordclouds with *feeling*

- originally, we may have had trouble discerning what the text was about because so many of the common words don't invoke sentiment (e.g, Macduff, Banquo, Malcolm)
- how about we revisit our wordcloud based on positive/negative sentiment!
    - we'll use `wordcloud::comparison.cloud( )`
    - note: we'll need `acast()` to turn the data frame into a matrix
    - Q: Anything strange about this word cloud?
        

```{r eval=FALSE}
ModernMacbeth_tidy %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 45))
```



```{r}
require(reshape2)

ModernMacbeth_tidy %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"), max.words = 45)
```



<!-- # Sentence tokenization (Aside) -->


<!-- ```{r} -->
<!-- ModernMacbeth_tidy_sent <-  -->
<!--   ModernMacbeth %>% -->
<!--   select(text, act = docvar1, scene = docvar2) %>% -->
<!--   unnest_tokens(output = sentence, input = text, token = "sentences") %>% # sentence tokenization -->
<!--   mutate(sentence = gsub(pattern = "’|‘", replacement = "'", x = sentence))  # fix quotes -->

<!-- ModernMacbeth_tidy_sent$sentence %>% head() -->
<!-- ``` -->


# Positive scenes in Macbeth

- we want to find the most positive scene in each act
- the ratios aren't good
- in the actual text, things go pretty south pretty fast...
    - Act 1, Scene 6: Duncan & Banquo on a leisurely ride through the countryside (genuinely pleasant)
    - Act 2, Scene 1: Macbeth trying to act natural for Banquo (while planning to kill Duncan)
    - Act 3, Scene 2: Macbeth wishing he was dead like Duncan 
    - Act 4, Scene 2: super long, I didn't read it all
    - Act 5, Scene 6: Malcolm & Macduff psych themselves up to kill Macbeth

```{r}
bingpositive <- get_sentiments("bing") %>% 
  filter(sentiment == "positive")

wordcounts <- ModernMacbeth_tidy %>%
  group_by(act, scene) %>%
  summarize(words = n())

ModernMacbeth_tidy %>%
  semi_join(bingpositive) %>%
  group_by(act, scene) %>%
  summarize(positivewords = n()) %>%
  left_join(wordcounts, by = c("act", "scene")) %>%
  mutate(ratio = positivewords / words) %>%         # ratio of positive words
  filter(scene != 0) %>%
  top_n(1) %>%
  ungroup()
```


# Analysis of word & document frequency

- a central question in natural language processing is how to quantify what a document is about
- one approach as we have discussed is simply term frequency 
    - this may draw attention common words regardless of importance
    - stop word removal attempts to adjusting term frequency for common used words
    - this is a fairly crude approach; perhaps we can do better

# Document term matrices

- alternative approach for assessing term frequency is the **document-term matrix**
    - sparse matrix describing a collection (corpus) of documents
    - row for each document
    - column for each term
    - values are typically word count or "tf-idf" (see below)
- **term frequency**--$tf(t, d)$--is simply the frequency of term $t$ in document $d$
- **inverse document frequency**--$idf$--is prevalence of term $t$ across a set of documents $D$
    - words that are common across many documents like "the" get low rank
    - numerator is the number of documents in $D$
    - denominator is the number of documents in $D$ containing the term
- $tf-idf$ multiplies the two
    - frequency of a term adjusted for how rarely it is used
    - roughly measures how important a word is to a document in a collection (or corpus) of documents

\[idf(t, D) = \text{ln}\frac{|D|}{|\{d \in D : t \in d\}|}\]


# Term frequency in Acts of Macbeth

- What are the most commonly used words in each Act? 
    - term frequency
    - $tf-idf$
- we will "start over" from the source (without removing stop words)
- each row in `act_words` is a unique word-act combination
    - `n` is the number of times that word is used in that act 
    - `total` is the total words in the act

```{r}
# tf within each act
act_words <- 
  ModernMacbeth %>%
  select(text, act = docvar1, scene = docvar2) %>%
  unnest_tokens(output = word, input = text) %>%
  mutate(word = gsub(pattern = "’", replacement = "'", x = word)) %>%
  count(act, word, sort = TRUE)

# total words in the act
total_words <- 
  act_words %>% 
  group_by(act) %>% 
  summarize(total = sum(n))

act_words <- left_join(act_words, total_words)
act_words
```


# Term frequency in Macbeth

- let’s look at the distribution of `n / total` for each act
- this is the term frequency, $tf$ 
- very long right tails (those extremely common words) 
- These plots exhibit similar distributions for all the acts 
    - many words that occur rarely
    - fewer words that occur frequently


```{r}
act_words %>% 
  ggplot(aes(n / total, fill = act)) +
  geom_histogram(show.legend = FALSE) +
  facet_wrap(~ act, ncol = 2, scales = "free_y") + 
  ggtitle("Term frequency in Acts of Macbeth")
```



# Zipf's law

- long-tailed distributions are common in any given corpus of natural language (like a book, or a lot of text from a website, or spoken words) 
- a classic version of this relationship is called Zipf’s law, after George Zipf, a 20th century American linguist.
- **Zipf’s law** states that the frequency that a word appears is inversely proportional to its rank.

```{r}
freq_by_rank <- 
  act_words %>% 
  group_by(act) %>% 
  mutate(rank = row_number(),   # since already ordered by `n`
         `term frequency` = n / total)

freq_by_rank
```



# Visualization of Zipf's law

- Zipf’s law visualized by plotting (on logarithmic scales)
    - x-axis: `rank` of each word within the frequency table
    - y-axis: `term frequency` on the y-axis
    - result: inversely proportional relationship has a constant, negative slope
- this type of result is known as a **power law** <https://en.wikipedia.org/wiki/Power_law>
    - relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities
    - one quantity varies as a power of another
    - area of a square vs length of side; 2 times length >> $2^2$ times the area


```{r}
freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = act)) + 
  geom_line(size = 1.1, alpha = 0.8, show.legend = TRUE) + 
  scale_x_log10() +
  scale_y_log10() + 
  ggtitle("Zipf's law for acts of Macbeth")
```


# Investigating term frequency in Macbeth

- result wasn't *quite* linear
- deviations at high rank are not uncommon for many kinds of language; 
    - a corpus often contains fewer rare words than predicted by a single power law. 
- deviations at low rank are more unusual 
    - Modern Macbeth used a lower percentage of the most common words than many collections of language.
    - analysis could be extended to compare authors, or other collections of text (sonnets vs tragedies)


```{r}
rank_subset <- freq_by_rank %>% 
  filter(rank < 200,
         rank > 10)

lm(log10(`term frequency`) ~ log10(rank), data = rank_subset)

freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = act)) + 
  geom_abline(intercept = -0.946, slope = -0.909, color = "gray50", linetype = 2) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = TRUE) + 
  scale_x_log10() +
  scale_y_log10() + 
  ggtitle("Zipf's law for Modern Macbeth")
```

<!-- Day6 -->

# *tf-idf*

-  **idea of *tf-idf***: find important words for the content of each document by
    - *decreasing* the weight for commonly used words 
    - *increasing* the weight for words that are not used very much in the corpus of documents
- Calculating $tf-idf$ attempts to find the words that are important (i.e., common) in a text, but not too common across all texts.
- `bind_tf_idf( )` function in the `tidytext` package takes a tidy text dataset as input
    - only need one row per token (term), per document
    - column for the words, another identifying source document (act)
    - we calculated `total` for each act previously, but don't need it
- $idf$
    - zero for words that are common across all documents ($tf-idf$ is then zero too)
    - higher for words that appear in fewer documents


```{r}
act_words <- 
  act_words %>%
  bind_tf_idf(word, act, n)

act_words
```

# High *tf-idf* in Macbeth

- common to see proper nouns (people/places) from each document with the highest *tf-idf*
- interesting here to see other featured tokens (cauldron, hail, etc)! 
- note: discreteness in the $idf$ here because we have only 5 documents
    - term that appears in only one of the five acts: $idf = ln(5/1) = 1.6094$

```{r}
act_words %>%
  select(-total) %>%
  arrange(desc(tf_idf))
```


# Visualizing high *tf-idf* words

```{r}
act_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(act) %>% 
  top_n(10) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = act)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~act, ncol = 2, scales = "free") +
  coord_flip()
```




# Note about converting to/from other non-tidy formats

- there are lots of packages for text analysis, e.g., 
    - `quanteda`
    - `tm`
- these often rely on a common (non-tidy) data structure called a "document-term matrix"
- **Document-term matrix (DTM)** is a matrix such that
    - each *row* represents one document (book, article, tweet, etc)
    - each *column* represents one term
    - each value (typically) contains frequency of the term in the document


# Document-term matrix

- DTM's are usually very sparse since lots of term-document pairs don't occur
- DTM's don't play nice with tidy tools & data frames don't play nice with most text mining packages
    - `tidytext` package has tools to easily convert between object types
    - `tidy()` turns a DTM into a tidy data frame
    - `cast()_dtm` turns a tidy, one-term-per-row, data frame into a DTM (for tools in `tm` pkg)
    - `cast()_dtm` (with a "T") turns a tidy, one-term-per-row, data frame into a DTM (for tools in `tm` pkg)
    - `cast()_dfm` (with an "F") turns a tidy, one-term-per-row, data frame into a DFM (for tools in `quanteda` pkg)



# Topic Modeling

- sometimes useful to find natural groups among documents of some collection/corpus (blog posts, wikipedia pages, etc) 
- **Topic modeling**--method for unsupervised classification of documents
    - find natural groups even if we're not totally sure what to look for
    - similar to clustering on numeric data 
- Latent Dirichlet allocation (LDA) is a particularly popular method


# Latent Dirichlet allocation (LDA)

- **Latent Dirichlet allocation (LDA)** 
    - every document is a mixture of *topics*
    - every topic is a mixture of *words*
    - LDA basically estimates the composition of both mixtures at the same time
- result allows content of documents to "overlap" rather than enforcing discrete groups...just like we do in natural language use.
- the `topicmodels` package will help us on our way



# Preparing for LDA

- We'll borrow a data set of articles from the Associated Press to illustrate some principles first.  
    - Don't worry, back to Macbeth in a bit
- Note: We need the data as a DTM (not tidy form)
- Q: can you explain what's happening at each line?
- Q: how many rows & columns are in our `AssociatedPress` matrix?

```{r}
data("AssociatedPress")
AssociatedPress
```

# Two-topic Latent Dirichlet allocation (LDA) model

- fitting the LDA model is the easy part with help of the `LDA( )` function
- note: we set an arbitrary seed in this case so we have same result each time... not necessary in general
- There are almost certainly more than two topics, but this is a start
- For a corpus of news articles like the AP data, we might expect the topics to be something like "politics" and "entertainment"... recall: 
    - each document could include a mix of topics (politics & entertainment)
    - each topic has a mix of words
        - *politics* topic might include words like 'president', 'congress', and 'government
        - *entertainment* topic might include 'movies', 'television', and 'actor'
        - **both** topics might frequently include a word like 'budget'

```{r}
require(topicmodels)

ap_lda <- 
  AssociatedPress %>%
  LDA(k = 2, control = list(seed = 1234))  

ap_lda
```

# Per-topic-per-word probabilities 

- want to extract the per-topic-per-word probabilities, $\beta$ ("beta"), from the model
- `tidytext::tidy( )` provides method 
    - one-topic-per-term-per-row format
    - For each combination, model computes probability of term being generated from that topic

```{r}
ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
```


# Top 10 per-topic-per-word probabilities ($\beta$) 

- Q: how might you characterize each topic now?
    - note words characteristic of each topic
    - note words common to both topics

```{r}
ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

ap_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()
```


# Words with greatest difference in $\beta$ between topics

- estimated based on the log ratio of the two: $\text{log}_2(\frac{\beta_2}{\beta_1})$
- a log ratio is useful because it makes the difference symmetrical
    - $\beta_2$ being twice as large leads to a log ratio of 1
    - $\beta_1$ being twice as large results in -1
- we filter for relatively common words with $\beta > 1/1000$ in at least one topic
- Q: what can we learn by inspecting the extremes?

```{r}
beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1)) %>%
  arrange(desc(log_ratio))

beta_spread
```



# Per-document classification

- LDA also models each document as a mixture of topics
- We examine the per-document-per-topic probabilities, $\gamma$ ("gamma")
- Gamma is proportion of words from that document, generated from that topic.  For example,
    - about 25% of the words in document 1 come from topic 1
    - about 82% of the words in document 5 come from topic 2

```{r}
scenes_gamma <- tidy(ap_lda, matrix = "gamma")
scenes_gamma %>% 
  arrange(document, gamma)
```

# Let's investigate a few interesting documents

- Document 3 is almost 50-50 between our two topics
- Document 6 is almost entirely topic 2
- Q: what does each document seem to be about?
- Q: does $\gamma$ attributed to each topic make sense in these cases?

```{r}
# investigate document 3 & 6
tidy(AssociatedPress) %>%
  filter(document == 3) %>%
  arrange(desc(count))
```


# Modeling Topics as Acts in Macbeth...?

- Recall: our motivating question was to try and learn if we could use the content of Macbeth to expose structure among the 5 acts
- Results so far: 
    - wordclouds: "meh"
    - sentiment analysis: super negative...no wonder Macbeth is called a "tragedy"
    - *tf-idf*: hard to tell much... all the documents were written by the same person for the same purpose, so not surprised
    - **LDA**: ???
- Goal: use LDA to model `k = 5` topics 
    - common to try a few different values of `k` (number of topics)
    - start with 5 here because we know there are 5 acts in the play


```{r}
# convert to Document Term Matrix

ModernMacbeth_DTM <- 
  ModernMacbeth %>%
  mutate(act_scene = gsub(pattern = "\\.txt", replacement = "", x = doc_id)) %>%
  unnest_tokens(output = word, input = text) %>%
  mutate(word = gsub(pattern = "’", replacement = "'", x = word)) %>%
  anti_join(stop_words, by = c("word" = "word")) %>%     
  count(act_scene, word, sort = TRUE) %>%   
  cast_dtm(document = act_scene, term = word, value = n)


```

```{r}
# LDA topic model (5 topics)
scenes_lda <- 
  ModernMacbeth_DTM %>%
  LDA(k = 5, control = list(seed = 380))  


# Per-topic-per-word probabilities 
scene_topics <- tidy(scenes_lda, matrix = "beta")
scene_topics


```


```{r}
# Visualize top per-topic-per-word probabilities
top_terms <- 
  scene_topics %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()



```


```{r}
# Per-document classification
scenes_gamma <- tidy(scenes_lda, matrix = "gamma")
scenes_gamma %>% 
  arrange(document, gamma)

```


```{r}
# Topics vs Acts?
scenes_gamma <- 
  scenes_gamma %>%
  separate(col = document, into = c("act", "scene"), sep = "_", convert = TRUE)

scenes_gamma %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_boxplot() +
  facet_wrap(~ act)

```



# Scene alignment to topics

- Okay, the topics aren't the acts
- turns out this enduring Shakespearean masterpiece needs a more nuanced interpretation
- Literary themes could align better to the Topics identified by our LDA...
- heres one attempt I found, with my wife's description of each color:
    - ambition (lavendar)
    - fate (poppy)
    - violence (Tiffany blue)
    - nature & unnatural (sand)
    - masculinity (Kendall gray)

![https://www.litcharts.com/lit/macbeth/chart-board-visualization](MacbethLiteraryThemes.png){ width=90% }



# Road map (again)

<center>

![https://www.tidytextmining.com/topicmodeling.html](tidytextFlow.png){width=95%}

</center>





