Introduction

This exercise follows the example in chapter 1 of the book “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson. link book site

The exercise illustrate how to analyze and compare word usage frequency in different authors’ books.

The code is modified from the sample code from the book. The modification made is for: * clarification of understanding, * rectification of coding errors due to version change, * and for some minor enhancement.

The authors included in the analysis are:

Jane Austen
H.G.Wells
Bronte Sisters

Learning Outcome

The following are the learning outcome:

how to prepare text into token(tidytext) format
how to remove stopwords
how to compute count word frequency
how to compare the word frequency between different authors’ books using visual plots
how to compare the word frequency between different authors’ books using correlation coefficient

Import library

The following are the libraries required for this exercise.

library(rmarkdown)
library(janeaustenr)
library(gutenbergr)
library(dplyr)
library(tidytext)
library(stringr)
library(tidyr)
library(scales)
library(ggplot2)
library(ggthemes)

Jane Austen Books

a. Import Jane Austen Books

jane_austen_book <- austen_books()

i.Check structure of data loaded

str(jane_austen_book)

## tibble [73,422 x 2] (S3: tbl_df/tbl/data.frame)
##  $ text: chr [1:73422] "SENSE AND SENSIBILITY" "" "by Jane Austen" "" ...
##  $ book: Factor w/ 6 levels "Sense & Sensibility",..: 1 1 1 1 1 1 1 1 1 1 ...

ii. Check books loaded

Check the list using levels() function for category data

levels(jane_austen_book$book)

## [1] "Sense & Sensibility" "Pride & Prejudice"   "Mansfield Park"     
## [4] "Emma"                "Northanger Abbey"    "Persuasion"

Another approach is using unique():

unique(jane_austen_book$book)

## [1] Sense & Sensibility Pride & Prejudice   Mansfield Park     
## [4] Emma                Northanger Abbey    Persuasion         
## 6 Levels: Sense & Sensibility Pride & Prejudice Mansfield Park ... Persuasion

Or, using the piping with distinct() funtion: Another approach is as follows:

jane_austen_book %>% distinct(book)

b. Extract line number and chapter number

jane_austen_book <- austen_books() %>% 
  group_by(book) %>%
    mutate(linenumber = row_number(),
    chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                            ignore_case=TRUE)))) %>%
  ungroup()
jane_austen_book

Explanation: (reference: https://regex101.com/r/3kzOvH/1)

^: asserts position at start of a line
chapter: matches the characters chapter literally
matches a digit (equivalent to [0-9])
ivxlc: matches a single character in list “ivxlc” (represent roman character)

c. Convert Text to tidytext

Convert Text to tidytext format. The tidytext format is in the form of one token (default word) per row.

jane_austen_tidy_book <- jane_austen_book %>% 
  unnest_tokens(word, text)
jane_austen_tidy_book

d. Remove Stop Words

i. Load Stop Words

data("stop_words")

The lexicon source of the stopwords:

“onix”
“SMART”
“snowball”

ii. remove stop_words

jane_austen_tidy_book <- jane_austen_tidy_book %>% 
  anti_join(stop_words)

jane_austen_tidy_book %>% 
  count(word, sort=TRUE) %>% 
  filter(n>500)

The words “miss” and “sir” are title for women and men in the text. There is no meaning to the words, so should include them as stopwords as well.

custom_stop_words <- bind_rows(tibble(word=c("miss", "sir", "ii", "iii", "iv", 
                                             "v", "vi", "vii", "viii","ix", 
                                             "xi", "xii", "xiii", "xiv", 
                                             "xv", "xvi", "nil", "NA"), 
                                      lexicon = c("custom")), 
                                      stop_words)
custom_stop_words

Then, remove the new customized stopwords list from the book.

jane_austen_tidy_book <- jane_austen_tidy_book %>% 
  anti_join(custom_stop_words)

jane_austen_tidy_book %>% 
  count(word, sort=TRUE) %>% 
  filter(n>500)

e. Count, Sort and Plot Most Common Words

jane_austen_tidy_book %>% 
  count(word, sort=TRUE) %>% 
  filter(n>500) %>%
  mutate(word=reorder(word,n)) %>%
  ggplot(aes(word,n)) + 
  geom_col() + 
  xlab(NULL) +
  ylab("Word Count") +
  coord_flip() + 
  ggtitle("Jane Austen Books Most Commonly Used Words") +
  theme_classic()

H.G. Wells Books

a. Download H.G. Wells Books

i. Check all the books by Wells, H. G. (Herbert George)

hgwells_booklist <- gutenberg_works(author == "Wells, H. G. (Herbert George)")
hgwells_list <- tibble(id= hgwells_booklist$gutenberg_id, book = hgwells_booklist$title)
hgwells_list

ii. Download Selected H.G. Wells books from Gutenberg

The following are the H.G. Wells books included (id in bracket):

The Time Machine [35]
The War of the Worlds [36]
The Invisible Man [5230]
The Island of Doctor Moreau [159]
In the Days of the Comet [3797]
The Food of the Gods and How It Came to Earth [11696]
The War in the Air [778]
When the Sleeper Awakes [775]
The First Men in the Moon [1013]
The World Set Free [1059]
The Country of the Blind, and Other Stories [11870]

hgwells_list <- c(35, 36, 5230, 159, 3797, 11696, 778, 775, 1013, 1059, 11870)
hgwells_book <- gutenberg_download(hgwells_list, 
                                   mirror = "http://mirrors.xmission.com/gutenberg/")

Note: Need to use mirror option as the default not working.

iii.Check structure of data loaded

str(hgwells_book)

## tibble [85,596 x 2] (S3: tbl_df/tbl/data.frame)
##  $ gutenberg_id: int [1:85596] 35 35 35 35 35 35 35 35 35 35 ...
##  $ text        : chr [1:85596] "The Time Machine" "" "An Invention" "" ...

iv. confirm the books extracted

for (book in hgwells_list) {
  book <- gutenberg_works(gutenberg_id == book)
  print(book$title)
}

## [1] "The Time Machine"
## [1] "The War of the Worlds"
## [1] "The Invisible Man: A Grotesque Romance"
## [1] "The Island of Doctor Moreau"
## [1] "In the Days of the Comet"
## [1] "The Food of the Gods and How It Came to Earth"
## [1] "Five Children and It"
## [1] "When the Sleeper Wakes"
## [1] "The First Men in the Moon"
## [1] "The World Set Free"
## [1] "The Country of the Blind, and Other Stories"

b. Convert to tidy format and remove stopwords

tidy_hgwells <- hgwells_book %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(custom_stop_words)
tidy_hgwells <- na.omit(tidy_hgwells)
tidy_hgwells

c. Count, Sort and Plot Most Common Words

tidy_hgwells %>% 
  count(word, sort=TRUE) %>% 
  filter(n>500) %>%
  mutate(word=reorder(word,n)) %>%
  ggplot(aes(word,n)) + 
  geom_col() + 
  xlab(NULL) +
  ylab("Word Count") +
  coord_flip() + 
  ggtitle("H.G.Wells Books Most Commonly Used Words") +
  theme_classic()

Bronte Sisters Books

a. Download Bronte Sisters

The following are the Bronte books included:

Jane Eyre
Wuthering Heights
The Tenant of Wildfell Hall
Villette
Agnes Grey

The following code download Bronte books from gutenberg.

bronte_book <- gutenberg_download(c(1260, 768, 969, 9182, 767), 
                                   mirror = "http://mirrors.xmission.com/gutenberg/")

Note: Need to use mirror option as the default not working.

i.Check structure of data loaded

str(bronte_book)

## tibble [80,089 x 2] (S3: tbl_df/tbl/data.frame)
##  $ gutenberg_id: int [1:80089] 767 767 767 767 767 767 767 767 767 767 ...
##  $ text        : chr [1:80089] "Agnes Grey" "A NOVEL," "" "by ACTON BELL." ...

ii. Check books loaded

book_list <- c(1260, 768, 969, 9182, 767)
for (book in book_list) {
  book <- gutenberg_works(gutenberg_id == book)
  print(book$title)
}

## [1] "Jane Eyre: An Autobiography"
## [1] "Wuthering Heights"
## [1] "The Tenant of Wildfell Hall"
## [1] "Villette"
## [1] "Agnes Grey"

b. Convert to tidy format and remove stopwords

tidy_bronte <- bronte_book %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(custom_stop_words)
tidy_bronte <- na.omit(tidy_bronte)
tidy_bronte

c. Count, Sort and Plot Most Common Words

tidy_bronte %>% 
  count(word, sort=TRUE) %>% 
  filter(n>500) %>%
  mutate(word=reorder(word,n)) %>%
  ggplot(aes(word,n)) + 
  geom_col() + 
  xlab(NULL) +
  ylab("Word Count") +
  coord_flip() + 
  ggtitle("Bronte Books Most Commonly Used Words") +
  theme_classic()

Comparison

Comparison by plots

Create a frequecy combined table with new column “author”

frequency <- bind_rows(mutate(tidy_bronte, author = "Bronte Sisters"),
                       mutate(tidy_hgwells, author = "H.G.Wells"), 
                       mutate(jane_austen_tidy_book, author = "Jane Austen")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  # count word by author
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = author, values_from = proportion) %>%
  pivot_longer("Bronte Sisters":"H.G.Wells",
               names_to = "author", values_to = "proportion")
  
frequency

ggplot(frequency, aes(x = proportion, y = `Jane Austen`, 
                      color = abs(`Jane Austen` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), 
                       low = "darkslategray4", high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Jane Austen", x = NULL)

## Warning: Removed 51916 rows containing missing values (geom_point).

## Warning: Removed 51916 rows containing missing values (geom_text).

Comparison by Correlation Coefficient

a. Between Jane Austen and Bronte sisters

cor.test(data = frequency[frequency$author == "Bronte Sisters",],
         ~ proportion + `Jane Austen`)

## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Jane Austen
## t = 116.29, df = 10211, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7463606 0.7630526
## sample estimates:
##       cor 
## 0.7548288

b. Between Jane Austen and H.G. Wells

cor.test(data = frequency[frequency$author == "H.G.Wells",], 
         ~ proportion + `Jane Austen`)

## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Jane Austen
## t = 61.692, df = 9287, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5245661 0.5534195
## sample estimates:
##      cor 
## 0.539151

@end

Word Frequency Comparison with JaneAusten/H.G.Wells/Bronte Sisters Books

Tan Bee Hoon

24May2021

Introduction

Learning Outcome

Import library

Jane Austen Books

a. Import Jane Austen Books

i.Check structure of data loaded

ii. Check books loaded

b. Extract line number and chapter number

c. Convert Text to tidytext

d. Remove Stop Words

i. Load Stop Words

ii. remove stop_words

e. Count, Sort and Plot Most Common Words

H.G. Wells Books

a. Download H.G. Wells Books

i. Check all the books by Wells, H. G. (Herbert George)

ii. Download Selected H.G. Wells books from Gutenberg

iii.Check structure of data loaded

iv. confirm the books extracted

b. Convert to tidy format and remove stopwords

c. Count, Sort and Plot Most Common Words

Bronte Sisters Books

a. Download Bronte Sisters

i.Check structure of data loaded

ii. Check books loaded

b. Convert to tidy format and remove stopwords

c. Count, Sort and Plot Most Common Words

Comparison

Comparison by plots

Create a frequecy combined table with new column “author”

Comparison by Correlation Coefficient

a. Between Jane Austen and Bronte sisters

b. Between Jane Austen and H.G. Wells