Introduction

This exercise follows the example in chapter 1 of the book “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson. link book site

The exercise illustrate how to analyze and compare word usage frequency in different authors’ books.

The code is modified from the sample code from the book. The modification made is for: * clarification of understanding, * rectification of coding errors due to version change, * and for some minor enhancement.

The authors included in the analysis are:

Learning Outcome

The following are the learning outcome:

Import library

The following are the libraries required for this exercise.

library(rmarkdown)
library(janeaustenr)
library(gutenbergr)
library(dplyr)
library(tidytext)
library(stringr)
library(tidyr)
library(scales)
library(ggplot2)
library(ggthemes)

Jane Austen Books

a. Import Jane Austen Books

jane_austen_book <- austen_books()

i.Check structure of data loaded

str(jane_austen_book)
## tibble [73,422 x 2] (S3: tbl_df/tbl/data.frame)
##  $ text: chr [1:73422] "SENSE AND SENSIBILITY" "" "by Jane Austen" "" ...
##  $ book: Factor w/ 6 levels "Sense & Sensibility",..: 1 1 1 1 1 1 1 1 1 1 ...

ii. Check books loaded

Check the list using levels() function for category data

levels(jane_austen_book$book)
## [1] "Sense & Sensibility" "Pride & Prejudice"   "Mansfield Park"     
## [4] "Emma"                "Northanger Abbey"    "Persuasion"

Another approach is using unique():

unique(jane_austen_book$book)
## [1] Sense & Sensibility Pride & Prejudice   Mansfield Park     
## [4] Emma                Northanger Abbey    Persuasion         
## 6 Levels: Sense & Sensibility Pride & Prejudice Mansfield Park ... Persuasion

Or, using the piping with distinct() funtion: Another approach is as follows:

jane_austen_book %>% distinct(book)

b. Extract line number and chapter number

jane_austen_book <- austen_books() %>% 
  group_by(book) %>%
    mutate(linenumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                            ignore_case=TRUE)))) %>%
  ungroup()
jane_austen_book

Explanation: (reference: https://regex101.com/r/3kzOvH/1)

  • ^: asserts position at start of a line

  • chapter: matches the characters chapter literally

  • matches a digit (equivalent to [0-9])

  • ivxlc: matches a single character in list “ivxlc” (represent roman character)

c. Convert Text to tidytext

Convert Text to tidytext format. The tidytext format is in the form of one token (default word) per row.

jane_austen_tidy_book <- jane_austen_book %>% 
  unnest_tokens(word, text)
jane_austen_tidy_book

d. Remove Stop Words

i. Load Stop Words

data("stop_words")

The lexicon source of the stopwords:

  • “onix”

  • “SMART”

  • “snowball”

ii. remove stop_words

jane_austen_tidy_book <- jane_austen_tidy_book %>% 
  anti_join(stop_words)


jane_austen_tidy_book %>% 
  count(word, sort=TRUE) %>% 
  filter(n>500)  

The words “miss” and “sir” are title for women and men in the text. There is no meaning to the words, so should include them as stopwords as well.

custom_stop_words <- bind_rows(tibble(word=c("miss", "sir", "ii", "iii", "iv", 
                                             "v", "vi", "vii", "viii","ix", 
                                             "xi", "xii", "xiii", "xiv", 
                                             "xv", "xvi", "nil", "NA"), 
                                      lexicon = c("custom")), 
                                      stop_words)
custom_stop_words     

Then, remove the new customized stopwords list from the book.


jane_austen_tidy_book <- jane_austen_tidy_book %>% 
  anti_join(custom_stop_words)   


jane_austen_tidy_book %>% 
  count(word, sort=TRUE) %>% 
  filter(n>500)  

e. Count, Sort and Plot Most Common Words

jane_austen_tidy_book %>% 
  count(word, sort=TRUE) %>% 
  filter(n>500) %>%
  mutate(word=reorder(word,n)) %>%
  ggplot(aes(word,n)) + 
  geom_col() + 
  xlab(NULL) +
  ylab("Word Count") +
  coord_flip() + 
  ggtitle("Jane Austen Books Most Commonly Used Words") +
  theme_classic()

H.G. Wells Books

a. Download H.G. Wells Books

i. Check all the books by Wells, H. G. (Herbert George)

hgwells_booklist <- gutenberg_works(author == "Wells, H. G. (Herbert George)")
hgwells_list <- tibble(id= hgwells_booklist$gutenberg_id, book = hgwells_booklist$title)
hgwells_list

ii. Download Selected H.G. Wells books from Gutenberg

The following are the H.G. Wells books included (id in bracket):

  • The Time Machine [35]

  • The War of the Worlds [36]

  • The Invisible Man [5230]

  • The Island of Doctor Moreau [159]

  • In the Days of the Comet [3797]

  • The Food of the Gods and How It Came to Earth [11696]

  • The War in the Air [778]

  • When the Sleeper Awakes [775]

  • The First Men in the Moon [1013]

  • The World Set Free [1059]

  • The Country of the Blind, and Other Stories [11870]

hgwells_list <- c(35, 36, 5230, 159, 3797, 11696, 778, 775, 1013, 1059, 11870)
hgwells_book <- gutenberg_download(hgwells_list, 
                                   mirror = "http://mirrors.xmission.com/gutenberg/")

Note: Need to use mirror option as the default not working.

iii.Check structure of data loaded

str(hgwells_book)
## tibble [85,596 x 2] (S3: tbl_df/tbl/data.frame)
##  $ gutenberg_id: int [1:85596] 35 35 35 35 35 35 35 35 35 35 ...
##  $ text        : chr [1:85596] "The Time Machine" "" "An Invention" "" ...

iv. confirm the books extracted

for (book in hgwells_list) {
  book <- gutenberg_works(gutenberg_id == book)
  print(book$title)
}
## [1] "The Time Machine"
## [1] "The War of the Worlds"
## [1] "The Invisible Man: A Grotesque Romance"
## [1] "The Island of Doctor Moreau"
## [1] "In the Days of the Comet"
## [1] "The Food of the Gods and How It Came to Earth"
## [1] "Five Children and It"
## [1] "When the Sleeper Wakes"
## [1] "The First Men in the Moon"
## [1] "The World Set Free"
## [1] "The Country of the Blind, and Other Stories"

b. Convert to tidy format and remove stopwords

tidy_hgwells <- hgwells_book %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(custom_stop_words)
tidy_hgwells <- na.omit(tidy_hgwells)
tidy_hgwells

c. Count, Sort and Plot Most Common Words

tidy_hgwells %>% 
  count(word, sort=TRUE) %>% 
  filter(n>500) %>%
  mutate(word=reorder(word,n)) %>%
  ggplot(aes(word,n)) + 
  geom_col() + 
  xlab(NULL) +
  ylab("Word Count") +
  coord_flip() + 
  ggtitle("H.G.Wells Books Most Commonly Used Words") +
  theme_classic()

Bronte Sisters Books

a. Download Bronte Sisters

The following are the Bronte books included:

  • Jane Eyre

  • Wuthering Heights

  • The Tenant of Wildfell Hall

  • Villette

  • Agnes Grey

The following code download Bronte books from gutenberg.

bronte_book <- gutenberg_download(c(1260, 768, 969, 9182, 767), 
                                   mirror = "http://mirrors.xmission.com/gutenberg/")

Note: Need to use mirror option as the default not working.

i.Check structure of data loaded

str(bronte_book)
## tibble [80,089 x 2] (S3: tbl_df/tbl/data.frame)
##  $ gutenberg_id: int [1:80089] 767 767 767 767 767 767 767 767 767 767 ...
##  $ text        : chr [1:80089] "Agnes Grey" "A NOVEL," "" "by ACTON BELL." ...

ii. Check books loaded

book_list <- c(1260, 768, 969, 9182, 767)
for (book in book_list) {
  book <- gutenberg_works(gutenberg_id == book)
  print(book$title)
}
## [1] "Jane Eyre: An Autobiography"
## [1] "Wuthering Heights"
## [1] "The Tenant of Wildfell Hall"
## [1] "Villette"
## [1] "Agnes Grey"

b. Convert to tidy format and remove stopwords

tidy_bronte <- bronte_book %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(custom_stop_words)
tidy_bronte <- na.omit(tidy_bronte)
tidy_bronte

c. Count, Sort and Plot Most Common Words

tidy_bronte %>% 
  count(word, sort=TRUE) %>% 
  filter(n>500) %>%
  mutate(word=reorder(word,n)) %>%
  ggplot(aes(word,n)) + 
  geom_col() + 
  xlab(NULL) +
  ylab("Word Count") +
  coord_flip() + 
  ggtitle("Bronte Books Most Commonly Used Words") +
  theme_classic()

Comparison

Comparison by plots

Create a frequecy combined table with new column “author”

frequency <- bind_rows(mutate(tidy_bronte, author = "Bronte Sisters"),
                       mutate(tidy_hgwells, author = "H.G.Wells"), 
                       mutate(jane_austen_tidy_book, author = "Jane Austen")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  # count word by author
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = author, values_from = proportion) %>%
  pivot_longer("Bronte Sisters":"H.G.Wells",
               names_to = "author", values_to = "proportion")
  
frequency


ggplot(frequency, aes(x = proportion, y = `Jane Austen`, 
                      color = abs(`Jane Austen` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), 
                       low = "darkslategray4", high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Jane Austen", x = NULL)
## Warning: Removed 51916 rows containing missing values (geom_point).
## Warning: Removed 51916 rows containing missing values (geom_text).

Comparison by Correlation Coefficient

a. Between Jane Austen and Bronte sisters

cor.test(data = frequency[frequency$author == "Bronte Sisters",],
         ~ proportion + `Jane Austen`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Jane Austen
## t = 116.29, df = 10211, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7463606 0.7630526
## sample estimates:
##       cor 
## 0.7548288

b. Between Jane Austen and H.G. Wells

cor.test(data = frequency[frequency$author == "H.G.Wells",], 
         ~ proportion + `Jane Austen`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Jane Austen
## t = 61.692, df = 9287, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5245661 0.5534195
## sample estimates:
##      cor 
## 0.539151

@end