6 Text Analysis

In this section, you will learn.

  1. More dplyr
  2. More ggplot
  3. The basics of tidytext
  4. The very basics of topicmodels

6.1 The Adventures of Tom Sawyer

## # A tibble: 1 x 2
##    Book value                                                                                                                                                                      
##   <int> <chr>                                                                                                                                                                      
## 1     1 "The Project Gutenberg EBook of The Adventures of Tom Sawyer, Complete\r\n\r\nby Mark Twain (Samuel Clemens)\r\n\r\n\r\n\r\nThis eBook is for the use of anyone anywhere a~
##   Book  value 
##      1 423754

6.2 Find Chapter Splits

To do the analysis, we need to parse the text. The purpose of this section is not a lesson text parsing so we’ll skip the detail. But I will discuss it a little in class.

## # A tibble: 35 x 3
##     Book Chapter Text                                                                                                                                                              
##    <int>   <dbl> <chr>                                                                                                                                                             
##  1     1       1 " (182K)      “TOM!”  No answer   “TOM!”  No answer   “What’s gone with that boy, I wonder? You TOM!”  No answer   The old lady pulled her spectacles down and lo~
##  2     1       2 " (202K)      SATURDAY morning was come, and all the summer world was bright and fresh, and brimming with life  There was a song in every heart; and if the heart~
##  3     1       3 " (197K)      TOM presented himself before Aunt Polly, who was sitting by an open window in a pleasant rearward apartment, which was bedroom, breakfast-room, din~
##  4     1       4 " (218K)      THE sun rose upon a tranquil world, and beamed down upon the peaceful village like a benediction  Breakfast over, Aunt Polly had family worship: it~
##  5     1       5 " (205K)      ABOUT half-past ten the cracked bell of the small church began to ring, and presently the people began to gather for the morning sermon  The Sunday~
##  6     1       6 " (202K)      MONDAY morning found Tom Sawyer miserable  Monday morning always found him so—because it began another week’s slow suffering in school  He generall~
##  7     1       7 " (175K)      THE harder Tom tried to fasten his mind on his book, the more his ideas wandered  So at last, with a sigh and a yawn, he gave it up  It seemed to h~
##  8     1       8 " (195K)      TOM dodged hither and thither through lanes until he was well out of the track of returning scholars, and then fell into a moody jog  He crossed a ~
##  9     1       9 " (174K)      AT half-past nine, that night, Tom and Sid were sent to bed, as usual  They said their prayers, and Sid was soon asleep  Tom lay awake and waited, ~
## 10     1      10 " (171K)      THE two boys flew on and on, toward the village, speechless with horror  They glanced backward over their shoulders from time to time, apprehensive~
## # ... with 25 more rows

6.3 Tokenize the Book

## # A tibble: 70,882 x 3
##     Book Chapter word  
##    <int>   <dbl> <chr> 
##  1     1       1 182k  
##  2     1       1 tom   
##  3     1       1 no    
##  4     1       1 answer
##  5     1       1 tom   
##  6     1       1 no    
##  7     1       1 answer
##  8     1       1 what’s
##  9     1       1 gone  
## 10     1       1 with  
## # ... with 70,872 more rows

6.4 Remove ‘stop words’

## Joining, by = "word"
## # A tibble: 26,251 x 3
##     Book Chapter word  
##    <int>   <dbl> <chr> 
##  1     1       1 182k  
##  2     1       1 tom   
##  3     1       1 answer
##  4     1       1 tom   
##  5     1       1 answer
##  6     1       1 what’s
##  7     1       1 boy   
##  8     1       1 tom   
##  9     1       1 answer
## 10     1       1 lady  
## # ... with 26,241 more rows

6.5 Join Sentiments

tidytext offers several different sentiment packages. Let’s explore.

## # A tibble: 2,477 x 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,467 more rows
## # A tibble: 6,786 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ... with 6,776 more rows
## # A tibble: 4,150 x 2
##    word         sentiment
##    <chr>        <chr>    
##  1 abandon      negative 
##  2 abandoned    negative 
##  3 abandoning   negative 
##  4 abandonment  negative 
##  5 abandonments negative 
##  6 abandons     negative 
##  7 abdicated    negative 
##  8 abdicates    negative 
##  9 abdicating   negative 
## 10 abdication   negative 
## # ... with 4,140 more rows
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

As you can see, each lexicon offers a slightly different way to explore your text.

## Joining, by = "word"
## # A tibble: 70,882 x 4
##     Book Chapter word   sentiment
##    <int>   <dbl> <chr>  <chr>    
##  1     1       1 182k   <NA>     
##  2     1       1 tom    <NA>     
##  3     1       1 no     <NA>     
##  4     1       1 answer <NA>     
##  5     1       1 tom    <NA>     
##  6     1       1 no     <NA>     
##  7     1       1 answer <NA>     
##  8     1       1 what’s <NA>     
##  9     1       1 gone   <NA>     
## 10     1       1 with   <NA>     
## # ... with 70,872 more rows
## Joining, by = "word"
## # A tibble: 4,778 x 4
##     Book Chapter word      sentiment
##    <int>   <dbl> <chr>     <chr>    
##  1     1       1 wonder    positive 
##  2     1       1 pride     positive 
##  3     1       1 well      positive 
##  4     1       1 perplexed negative 
##  5     1       1 loud      negative 
##  6     1       1 enough    positive 
##  7     1       1 well      positive 
##  8     1       1 noise     negative 
##  9     1       1 slack     negative 
## 10     1       1 well      positive 
## # ... with 4,768 more rows

6.6 Descriptive Text Statistics

## Joining, by = "word"
## # A tibble: 64 x 3
##    Chapter sentiment     n
##      <dbl> <chr>     <int>
##  1       1 negative     95
##  2       1 positive     81
##  3       2 negative     40
##  4       2 positive     66
##  5       3 negative     96
##  6       3 positive     84
##  7       4 negative     96
##  8       4 positive    147
##  9       5 negative     63
## 10       5 positive     60
## # ... with 54 more rows

6.8 N-Gram Analysis

6.8.1 Uni-Grams

## # A tibble: 7,774 x 2
##    word      n
##    <chr> <int>
##  1 the    3708
##  2 and    3059
##  3 a      1807
##  4 to     1696
##  5 of     1474
##  6 he     1158
##  7 was    1126
##  8 it     1090
##  9 in      943
## 10 that    875
## # ... with 7,764 more rows

6.8.2 Remove Stop Words

## Joining, by = "word"
## # A tibble: 1,358 x 3
##    word   sentiment     n
##    <chr>  <chr>     <int>
##  1 well   positive    152
##  2 like   positive    113
##  3 good   positive    101
##  4 work   positive     88
##  5 right  positive     83
##  6 great  positive     68
##  7 dead   negative     59
##  8 enough positive     57
##  9 poor   negative     52
## 10 cave   negative     41
## # ... with 1,348 more rows

6.8.4 Bigrams

## # A tibble: 70,848 x 3
##     Book Chapter bigram       
##    <int>   <dbl> <chr>        
##  1     1       1 182k tom     
##  2     1       1 tom no       
##  3     1       1 no answer    
##  4     1       1 answer tom   
##  5     1       1 tom no       
##  6     1       1 no answer    
##  7     1       1 answer what’s
##  8     1       1 what’s gone  
##  9     1       1 gone with    
## 10     1       1 with that    
## # ... with 70,838 more rows
## # A tibble: 41,080 x 2
##    bigram        n
##    <chr>     <int>
##  1 of the      364
##  2 in the      298
##  3 and the     184
##  4 it was      175
##  5 to the      175
##  6 he was      147
##  7 and then    126
##  8 was a       116
##  9 he had      110
## 10 there was   110
## # ... with 41,070 more rows

6.8.5 Remove Stop Words in Bigrams

## # A tibble: 70,848 x 4
##     Book Chapter word1  word2 
##    <int>   <dbl> <chr>  <chr> 
##  1     1       1 182k   tom   
##  2     1       1 tom    no    
##  3     1       1 no     answer
##  4     1       1 answer tom   
##  5     1       1 tom    no    
##  6     1       1 no     answer
##  7     1       1 answer what’s
##  8     1       1 what’s gone  
##  9     1       1 gone   with  
## 10     1       1 with   that  
## # ... with 70,838 more rows
## # A tibble: 6,910 x 3
##    word1     word2          n
##    <chr>     <chr>      <int>
##  1 project   gutenberg     84
##  2 gutenberg tm            56
##  3 injun     joe           45
##  4 aunt      polly         42
##  5 tom       sawyer        23
##  6 injun     joe’s         18
##  7 tm        electronic    18
##  8 muff      potter        15
##  9 archive   foundation    13
## 10 gutenberg literary      13
## # ... with 6,900 more rows

6.9 Term Frequency

Term Frequency: The number of times that a term occurs in the book.

Inverse Document Frequency: \(\ln(\frac{Total Number of Documents, cache = TRUE}{Total Number of Documents Containing Specified Word, cache = TRUE})\): Measure of how much information the word provides.

Term Frequency - Inverse Document Frequency: Term Frequency * Inverse Document Frequency

6.9.1 Build TF-IDF Data

Words By Chapter

## # A tibble: 7,774 x 36
##    word       n   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`  `11`  `12`  `13`  `14`  `15`  `16`  `17`  `18`  `19`  `20`  `21`  `22`  `23`  `24`  `25`  `27`  `28`
##    <chr>  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1 a         32    66    75    62   132    58    94    34    53    54    52    29    42    81    78    36   129    20    56    26    NA    56    42    41     7    NA    32    25
##  2 all       32     4     8     9    20     7    12    12     6    11     2     3     9     9     6     4    28     5    11     5    NA     9     3    12     1    NA     8     7
##  3 and       32   102    77   110   166   101   139    96    88    99    87    66    77   113   107    86   269    67   138    28    NA   100    49    62    13    NA    27    41
##  4 as        32    16     4    12    18    22    14    10     7    12    10    11     9     9    20     5    26    11    27     2    NA     7     6    14     5    NA     6     8
##  5 be        32     4     5     4    15     7     3     6    11     9     4     5     8    16     7     8    21     3    16     5    NA    10     3    10     3    NA     7     4
##  6 before    32     3     3     4     2     2     3     4     3     2     3     6     2     2     4     5     8     4     2     1    NA     1     2     4     3    NA     2     2
##  7 but       32    19    16    18    22    14    33    22    11    11    18    10    13    28    11    13    46     9    25     8    NA    14     9    21     6    NA     4     7
##  8 for       32    26    11    18    46    27    15     9     6    14    10    12    16    15    12    17    45     3    20     3    NA    11    14    17     3    NA     7     6
##  9 got       32     6     6     6     7     1    10     4     2     5     7     2     1     4     3     2    13     3     7     4    NA     3     4     4     2    NA     1     6
## 10 had       32    13    13    22    27    15    13     8    17     9     8    17    12    21    10    11    45     9    19     6    NA    15    10     5    12    NA    12     5
## # ... with 7,764 more rows, and 7 more variables: `29` <int>, `30` <int>, `31` <int>, `32` <int>, `33` <int>, `34` <int>, `35` <int>

Word Frequency Per Chapter and Book

## Joining, by = "word"
## # A tibble: 24,205 x 4
##    Chapter word  Chapter_Total Book_Total
##      <dbl> <chr>         <int>      <int>
##  1      16 the             275       3708
##  2      16 and             269       3059
##  3      35 the             245       3708
##  4      33 the             205       3708
##  5      30 the             187       3708
##  6       4 the             185       3708
##  7       5 the             167       3708
##  8       4 and             166       3059
##  9      29 the             166       3708
## 10      21 the             162       3708
## # ... with 24,195 more rows

Create TF-IDF

## Joining, by = "word"
## # A tibble: 19,903 x 7
##    Chapter word   Chapter_Total Book_Total      tf   idf tf_idf
##      <dbl> <chr>          <int>      <int>   <dbl> <dbl>  <dbl>
##  1      25 193k               1          2 0.5      2.83 1.42  
##  2      20 178k               1          3 0.5      2.43 1.21  
##  3      20 t                  1          4 0.5      2.43 1.21  
##  4      25 t                  1          4 0.5      2.43 1.21  
##  5      35 works             32         33 0.00636  2.83 0.0180
##  6      19 auntie             6         17 0.00754  2.14 0.0161
##  7      34 jones              5          9 0.00563  2.83 0.0160
##  8      32 cave               9         41 0.00863  1.58 0.0136
##  9      35 e                 23         24 0.00457  2.83 0.0129
## 10      11 potter            10         39 0.00667  1.92 0.0128
## # ... with 19,893 more rows

6.10 Topic Modeling

Create Document Term Matrix

## Joining, by = "word"

Create a reproducible example of two topics

## A LDA_VEM topic model with 2 topics.

Extract Topics and ‘Beta’ of each topic.

Beta represents topic-word density.

Beta: In each topic, how dense is this word?

Higher is more dense. Lower is less dense

## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 3,526 x 3
##    topic term            beta
##    <int> <chr>          <dbl>
##  1     1 advantage   3.10e- 4
##  2     2 advantage   9.02e-79
##  3     1 adventurous 1.03e- 4
##  4     2 adventurous 7.65e-79
##  5     1 afraid      1.95e- 3
##  6     2 afraid      2.91e- 3
##  7     1 arrest      1.03e- 4
##  8     2 arrest      3.63e-79
##  9     1 astronomer  4.14e- 4
## 10     2 astronomer  4.49e-78
## # ... with 3,516 more rows
## # A tibble: 3,526 x 3
##    topic term      beta
##    <int> <chr>    <dbl>
##  1     1 good   0.0333 
##  2     1 boy    0.0176 
##  3     1 money  0.0152 
##  4     1 found  0.0119 
##  5     1 aunt   0.0116 
##  6     1 white  0.0105 
##  7     1 time   0.0104 
##  8     1 awful  0.00809
##  9     1 tree   0.00797
## 10     1 mother 0.00794
## # ... with 3,516 more rows

Top Terms

## # A tibble: 20 x 3
##    topic term      beta
##    <int> <chr>    <dbl>
##  1     1 good   0.0333 
##  2     1 boy    0.0176 
##  3     1 money  0.0152 
##  4     1 found  0.0119 
##  5     1 aunt   0.0116 
##  6     1 white  0.0105 
##  7     1 time   0.0104 
##  8     1 awful  0.00809
##  9     1 tree   0.00797
## 10     1 mother 0.00794
## 11     2 good   0.0279 
## 12     2 found  0.0209 
## 13     2 hope   0.0181 
## 14     2 awful  0.0162 
## 15     2 boy    0.0106 
## 16     2 time   0.0105 
## 17     2 muff   0.00911
## 18     2 devil  0.00911
## 19     2 young  0.00895
## 20     2 murder 0.00820

6.10.1 Comparison of Use Between Topics

## # A tibble: 10 x 4
##    term          topic1  topic2 log_ratio
##    <chr>          <dbl>   <dbl>     <dbl>
##  1 outburst    4.53e-85 0.00106      270.
##  2 including   9.30e-85 0.00137      270.
##  3 freely      1.25e-84 0.00182      270.
##  4 providing   2.22e-84 0.00243      269.
##  5 fee         2.40e-84 0.00243      269.
##  6 worry       1.29e-84 0.00122      269.
##  7 damages     1.54e-84 0.00122      269.
##  8 agreement   8.21e-84 0.00547      268.
##  9 provide     3.84e-84 0.00213      268.
## 10 information 2.79e-84 0.00122      268.
## # A tibble: 10 x 4
##    term          topic1   topic2 log_ratio
##    <chr>          <dbl>    <dbl>     <dbl>
##  1 labor        0.00259 9.14e-79     -251.
##  2 worship      0.00103 3.53e-78     -247.
##  3 owing        0.00290 1.88e-77     -246.
##  4 cutting      0.00103 7.69e-78     -246.
##  5 highest      0.00124 1.96e-77     -245.
##  6 comrade      0.00166 3.32e-77     -245.
##  7 music        0.00155 6.73e-77     -244.
##  8 grim         0.00124 6.40e-77     -243.
##  9 difficulty   0.00166 1.14e-76     -243.
## 10 indifference 0.00103 1.69e-76     -242.

“Gamma”: From the documentation:

Each of these values is an estimated proportion of words from that document that are generated from that topic.

For example, the model estimates that about 41.7% of the words in document 6 were generated from topic 1. 58.3% of the words in document 6 were generated by topic 2.

## # A tibble: 64 x 3
##    document topic     gamma
##    <chr>    <int>     <dbl>
##  1 1            1 1.00     
##  2 1            2 0.0000677
##  3 2            1 1.00     
##  4 2            2 0.000114 
##  5 3            1 1.00     
##  6 3            2 0.0000601
##  7 4            1 1.00     
##  8 4            2 0.0000469
##  9 5            1 1.00     
## 10 5            2 0.0000702
## # ... with 54 more rows
## # A tibble: 2 x 3
##   document topic gamma
##   <chr>    <int> <dbl>
## 1 6            1 0.417
## 2 6            2 0.583