This exercise follows the example in chapter 1 of the book “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson. link book site
The exercise illustrate how to analyze and compare word usage frequency in different authors’ books.
The code is modified from the sample code from the book. The modification made is for: * clarification of understanding, * rectification of coding errors due to version change, * and for some minor enhancement.
The authors included in the analysis are:
Jane Austen
H.G.Wells
Bronte Sisters
The following are the learning outcome:
how to prepare text into token(tidytext) format
how to remove stopwords
how to compute count word frequency
how to compare the word frequency between different authors’ books using visual plots
how to compare the word frequency between different authors’ books using correlation coefficient
The following are the libraries required for this exercise.
library(rmarkdown)
library(janeaustenr)
library(gutenbergr)
library(dplyr)
library(tidytext)
library(stringr)
library(tidyr)
library(scales)
library(ggplot2)
library(ggthemes)
jane_austen_book <- austen_books()
str(jane_austen_book)
## tibble [73,422 x 2] (S3: tbl_df/tbl/data.frame)
## $ text: chr [1:73422] "SENSE AND SENSIBILITY" "" "by Jane Austen" "" ...
## $ book: Factor w/ 6 levels "Sense & Sensibility",..: 1 1 1 1 1 1 1 1 1 1 ...
Check the list using levels() function for category data
levels(jane_austen_book$book)
## [1] "Sense & Sensibility" "Pride & Prejudice" "Mansfield Park"
## [4] "Emma" "Northanger Abbey" "Persuasion"
Another approach is using unique():
unique(jane_austen_book$book)
## [1] Sense & Sensibility Pride & Prejudice Mansfield Park
## [4] Emma Northanger Abbey Persuasion
## 6 Levels: Sense & Sensibility Pride & Prejudice Mansfield Park ... Persuasion
Or, using the piping with distinct() funtion: Another approach is as follows:
jane_austen_book %>% distinct(book)
jane_austen_book <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case=TRUE)))) %>%
ungroup()
jane_austen_book
Explanation: (reference: https://regex101.com/r/3kzOvH/1)
^: asserts position at start of a line
chapter: matches the characters chapter literally
matches a digit (equivalent to [0-9])
ivxlc: matches a single character in list “ivxlc” (represent roman character)
Convert Text to tidytext format. The tidytext format is in the form of one token (default word) per row.
jane_austen_tidy_book <- jane_austen_book %>%
unnest_tokens(word, text)
jane_austen_tidy_book
data("stop_words")
The lexicon source of the stopwords:
“onix”
“SMART”
“snowball”
jane_austen_tidy_book <- jane_austen_tidy_book %>%
anti_join(stop_words)
jane_austen_tidy_book %>%
count(word, sort=TRUE) %>%
filter(n>500)
The words “miss” and “sir” are title for women and men in the text. There is no meaning to the words, so should include them as stopwords as well.
custom_stop_words <- bind_rows(tibble(word=c("miss", "sir", "ii", "iii", "iv",
"v", "vi", "vii", "viii","ix",
"xi", "xii", "xiii", "xiv",
"xv", "xvi", "nil", "NA"),
lexicon = c("custom")),
stop_words)
custom_stop_words
Then, remove the new customized stopwords list from the book.
jane_austen_tidy_book <- jane_austen_tidy_book %>%
anti_join(custom_stop_words)
jane_austen_tidy_book %>%
count(word, sort=TRUE) %>%
filter(n>500)
jane_austen_tidy_book %>%
count(word, sort=TRUE) %>%
filter(n>500) %>%
mutate(word=reorder(word,n)) %>%
ggplot(aes(word,n)) +
geom_col() +
xlab(NULL) +
ylab("Word Count") +
coord_flip() +
ggtitle("Jane Austen Books Most Commonly Used Words") +
theme_classic()
hgwells_booklist <- gutenberg_works(author == "Wells, H. G. (Herbert George)")
hgwells_list <- tibble(id= hgwells_booklist$gutenberg_id, book = hgwells_booklist$title)
hgwells_list
The following are the H.G. Wells books included (id in bracket):
The Time Machine [35]
The War of the Worlds [36]
The Invisible Man [5230]
The Island of Doctor Moreau [159]
In the Days of the Comet [3797]
The Food of the Gods and How It Came to Earth [11696]
The War in the Air [778]
When the Sleeper Awakes [775]
The First Men in the Moon [1013]
The World Set Free [1059]
The Country of the Blind, and Other Stories [11870]
hgwells_list <- c(35, 36, 5230, 159, 3797, 11696, 778, 775, 1013, 1059, 11870)
hgwells_book <- gutenberg_download(hgwells_list,
mirror = "http://mirrors.xmission.com/gutenberg/")
Note: Need to use mirror option as the default not working.
str(hgwells_book)
## tibble [85,596 x 2] (S3: tbl_df/tbl/data.frame)
## $ gutenberg_id: int [1:85596] 35 35 35 35 35 35 35 35 35 35 ...
## $ text : chr [1:85596] "The Time Machine" "" "An Invention" "" ...
for (book in hgwells_list) {
book <- gutenberg_works(gutenberg_id == book)
print(book$title)
}
## [1] "The Time Machine"
## [1] "The War of the Worlds"
## [1] "The Invisible Man: A Grotesque Romance"
## [1] "The Island of Doctor Moreau"
## [1] "In the Days of the Comet"
## [1] "The Food of the Gods and How It Came to Earth"
## [1] "Five Children and It"
## [1] "When the Sleeper Wakes"
## [1] "The First Men in the Moon"
## [1] "The World Set Free"
## [1] "The Country of the Blind, and Other Stories"
tidy_hgwells <- hgwells_book %>%
unnest_tokens(word, text) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
anti_join(custom_stop_words)
tidy_hgwells <- na.omit(tidy_hgwells)
tidy_hgwells
tidy_hgwells %>%
count(word, sort=TRUE) %>%
filter(n>500) %>%
mutate(word=reorder(word,n)) %>%
ggplot(aes(word,n)) +
geom_col() +
xlab(NULL) +
ylab("Word Count") +
coord_flip() +
ggtitle("H.G.Wells Books Most Commonly Used Words") +
theme_classic()
The following are the Bronte books included:
Jane Eyre
Wuthering Heights
The Tenant of Wildfell Hall
Villette
Agnes Grey
The following code download Bronte books from gutenberg.
bronte_book <- gutenberg_download(c(1260, 768, 969, 9182, 767),
mirror = "http://mirrors.xmission.com/gutenberg/")
Note: Need to use mirror option as the default not working.
str(bronte_book)
## tibble [80,089 x 2] (S3: tbl_df/tbl/data.frame)
## $ gutenberg_id: int [1:80089] 767 767 767 767 767 767 767 767 767 767 ...
## $ text : chr [1:80089] "Agnes Grey" "A NOVEL," "" "by ACTON BELL." ...
book_list <- c(1260, 768, 969, 9182, 767)
for (book in book_list) {
book <- gutenberg_works(gutenberg_id == book)
print(book$title)
}
## [1] "Jane Eyre: An Autobiography"
## [1] "Wuthering Heights"
## [1] "The Tenant of Wildfell Hall"
## [1] "Villette"
## [1] "Agnes Grey"
tidy_bronte <- bronte_book %>%
unnest_tokens(word, text) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
anti_join(custom_stop_words)
tidy_bronte <- na.omit(tidy_bronte)
tidy_bronte
tidy_bronte %>%
count(word, sort=TRUE) %>%
filter(n>500) %>%
mutate(word=reorder(word,n)) %>%
ggplot(aes(word,n)) +
geom_col() +
xlab(NULL) +
ylab("Word Count") +
coord_flip() +
ggtitle("Bronte Books Most Commonly Used Words") +
theme_classic()
cor.test(data = frequency[frequency$author == "Bronte Sisters",],
~ proportion + `Jane Austen`)
##
## Pearson's product-moment correlation
##
## data: proportion and Jane Austen
## t = 116.29, df = 10211, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7463606 0.7630526
## sample estimates:
## cor
## 0.7548288
cor.test(data = frequency[frequency$author == "H.G.Wells",],
~ proportion + `Jane Austen`)
##
## Pearson's product-moment correlation
##
## data: proportion and Jane Austen
## t = 61.692, df = 9287, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5245661 0.5534195
## sample estimates:
## cor
## 0.539151
@end