Text Analysis in R

class: center, middle, title-slide

# Text Analysis in R
## Special Christmas Edition
### Emil Hvitfeldt
### 2018-10-29

---

# What is Natural Language Processing (NLP)

Using computers to extract insights and make decision based on human languages.

- Information extraction
- Machine translation
- Speech processing
- Image understanding

???

https://www.ling.upenn.edu/~beatrice/humor/headlines.html

Not Computational linguistics which is using computers to reason about human langauges

---

# Plan of Action

## Text mining

We will be doing text mining as exploratory data analysis

## Modeling

Apply some simple models to make decisions only based on text

---

background-image: url("http://3.bp.blogspot.com/-FDjkfNyHOCA/UNNVcB9eBsI/AAAAAAAABp4/fivfG6senDU/s1600/fir-tree1.jpg")
background-position: 90% 35%
background-size: 20% 30%

# The Data

## Text mining

The Fir Tree  
H.C. Andersen  
21 December 1844  
EK (**e**ventyr**k**ode / fairly tale code) = 26

Fairly early, one of first works displaying pessimism.

???

Types of data

- Strings
- Document term matrix
- Corpus

---

# The Data

## Modeling

Movie reviews from [IMDb.com](IMDb.com)

2 Movies

---

.pull-left[
![](https://m.media-amazon.com/images/M/MV5BMzUxNzkzMzQtYjIxZC00NzU0LThkYTQtZjNhNTljMTA1MDA1L2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_.jpg)
]

.pull-right[
![](https://m.media-amazon.com/images/M/MV5BNWNiNTczNzEtMjQyZC00MjFmLTkzMDMtODk4ZGMyZmE0N2E4XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_.jpg)
]

---

# Finding gold

.medium[

```r
#devtools::install_github("emilhvitfeldt/hcandersenr")
library(hcandersenr)
```
]

Includes most of H.C. Andersens 157 fairly tales in 5 languages (Danish, German, English, Spanish, French).

.medium[

```r
library(dplyr)
```

```
## 
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
## 
##     filter, lag
```

```
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
```

```r
library(tidyr)
tree <- hcandersen_en %>%
  filter(book == "The fir tree") %>%
  select(text)
```
]

---

# The Text

.smallish[

```r
tree
```

```
## # A tibble: 253 x 1
##    text                                                                         
##    <chr>                                                                        
##  1 "Far down in the forest, where the warm sun and the fresh air made a sweet"  
##  2 "resting-place, grew a pretty little fir-tree; and yet it was not happy, it" 
##  3 "wished so much to be tall like its companions– the pines and firs which gre…
##  4 "around it. The sun shone, and the soft air fluttered its leaves, and the"   
##  5 "little peasant children passed by, prattling merrily, but the fir-tree heed…
##  6 "them not. Sometimes the children would bring a large basket of raspberries …
##  7 "strawberries, wreathed on a straw, and seat themselves near the fir-tree, a…
##  8 "say, \"Is it not a pretty little tree?\" which made it feel more unhappy th…
##  9 "before."                                                                    
## 10 "And yet all this while the tree grew a notch or joint taller every year; fo…
## # … with 243 more rows
```
]

---

.pull-left[

```r
library(tidytext)

*unnest_tokens(tree, word, text)
```

```
## # A tibble: 3,288 x 1
##    word  
##    <chr> 
##  1 far   
##  2 down  
##  3 in    
##  4 the   
##  5 forest
##  6 where 
##  7 the   
##  8 warm  
##  9 sun   
## 10 and   
## # … with 3,278 more rows
```
]

.pull-right[
Observational unit:

- Document
- Sentence
- Word
- Letter
]

---

.pull-left[

```r
library(tidytext)

*unnest_tokens(tree, word, text)
```

.pull-right[
Observational unit:

- ~~Document~~
- ~~Sentence~~
- **Word**
- ~~Letter~~

Word tokens are default in `unnest_tokens()`
]

---

```r
library(tidytext)

unnest_tokens(tree, word, text) %>%
* count(word, sort = TRUE)
```

```
## # A tibble: 745 x 2
##    word      n
##    <chr> <int>
##  1 the     278
##  2 and     161
##  3 tree     76
##  4 it       66
##  5 a        56
##  6 in       55
##  7 of       54
##  8 to       54
##  9 i        53
## 10 was      39
## # … with 735 more rows
```

---

```r
library(tidytext)

unnest_tokens(tree, word, text) %>%
* count(word, sort = TRUE)
```

```
## # A tibble: 745 x 2
##    word      n
##    <chr> <int>
*##  1 the     278
*##  2 and     161
##  3 tree     76
*##  4 it       66
*##  5 a        56
*##  6 in       55
*##  7 of       54
*##  8 to       54
*##  9 i        53
*## 10 was      39
## # … with 735 more rows
```

These words don't give us much context

---

# Stop words

Stop words or "non-context" words are words that doens't add much to the text (filler that make sentences work).

```r
stop_words$word
```

```
##    [1] "a"             "a's"           "able"          "about"        
##    [5] "above"         "according"     "accordingly"   "across"       
##    [9] "actually"      "after"         "afterwards"    "again"        
##   [13] "against"       "ain't"         "all"           "allow"        
##   [17] "allows"        "almost"        "alone"         "along"        
##   [21] "already"       "also"          "although"      "always"       
##   [25] "am"            "among"         "amongst"       "an"           
##   [29] "and"           "another"       "any"           "anybody"      
##   [33] "anyhow"        "anyone"        "anything"      "anyway"       
...
```

---

# Stop words

Stop words or "non-context" words are words that doens't add much to the text (filler that make sentences work).

Don't remove stop words willy nilly!

Pre-constructed word list might not work in your domain

Creating your own word list is hard...

???

Lean
computer
old

---

```r
unnest_tokens(tree, word, text) %>%
* inner_join(stop_words, by = "word") %>%
  count(word, sort = TRUE) %>%
  top_n(50, n) %>%
  pull(word)
```

```
##  [1] "the"   "and"   "it"    "a"     "in"    "of"    "to"    "i"     "was"  
## [10] "they"  "were"  "all"   "with"  "but"   "on"    "then"  "had"   "is"   
## [19] "at"    "not"   "what"  "as"    "that"  "he"    "would" "you"   "its"  
## [28] "out"   "so"    "be"    "them"  "this"  "down"  "said"  "for"   "have" 
## [37] "how"   "we"    "been"  "very"  "when"  "where" "who"   "or"    "there"
## [46] "while" "came"  "now"   "one"   "after" "by"    "my"    "only"  "their"
## [55] "which"
```

Look at the words you remove (easy)

or know you stop word by heart (hard!!!)

---

```r
unnest_tokens(tree, word, text) %>%
* anti_join(stop_words, by = "word") %>%
  count(word, sort = TRUE)
```

```
## # A tibble: 500 x 2
##    word         n
##    <chr>    <int>
##  1 tree        76
##  2 fir         33
##  3 branches    14
##  4 story       14
##  5 forest      13
##  6 mice        12
##  7 trees       12
##  8 children    10
##  9 dumpty      10
## 10 humpty      10
## # … with 490 more rows
```

---

```r
library(ggplot2)
unnest_tokens(tree, word, text) %>%
  anti_join(stop_words, by = "word") %>%
  count(word, sort = TRUE) %>%
  top_n(10, n) %>%
  ggplot(aes(reorder(word, n), n)) +
  geom_col() +
  coord_flip() +
  theme_light() +
  labs(x = "Times",
       y = "Word",
       title = "Word frequency in 'The Fur Tree'")
```

---

---

```r
unnest_tokens(tree, word, text) %>%
  mutate(pos = row_number(),
         place = word == "story") %>%
  filter(place) %>%
  ggplot(aes(pos, place)) +
  geom_point() +
  theme_light() +
  labs(x = "Position",
       y = "",
       title = "Occurence plot of word 'story'")
```

---

---

```r
unnest_tokens(tree, word, text) %>%
  mutate(pos = row_number(),
         place = case_when(word == "story" ~ "story",
                           word %in% c("tree", "trees") ~ "tree",
                           word == "mice" ~ "mice",
                           word == "children" ~ "children",
                           word == "forest" ~ "forest",
                           TRUE ~ NA_character_)) %>%
  drop_na() %>%
  ggplot(aes(pos, place, color = place)) +
  geom_point() +
  theme_light() +
  guides(color = "none") +
  labs(x = "Position",
       y = "Word",
       title = "Occurence plot of 'The Fur Tree'")
```

---

---

## Going to the movies

Scraped review (scraping_reviews.Rmd)

```r
library(readr)
library(tidyr)
reviews <- read_csv("review_data.csv") %>%
  select(movie, rating, review) %>%
  drop_na()
```

.smallish[

```
## # A tibble: 703 x 3
##    movie rating review                                                          
##    <chr>  <dbl> <chr>                                                           
##  1 elf        9 "This was my 4th viewing and now ready to review.The story is i…
##  2 elf       10 "This was one of my favorite movies in 2003. Will Ferrell is br…
##  3 elf        7 "Charming Christmas confection--about a boy raised by elves in …
##  4 elf       10 "Will Ferrell does a great job here, and it's the perfect Chris…
##  5 elf       10 "It was so nice to be able to sit down and watch a movie featur…
##  6 elf        6 "This is a good natured holiday comedy that definitely is in th…
##  7 elf        8 "BEWARE OF BOGUS REVIEWS. SOME REVIEWERS HAVE ONLY ONE REVIEW T…
##  8 elf        6 "A nice Christmas-themed movie. Nice is very apt for this title…
##  9 elf        9 "The most helpful thing I can tell you about this movie is that…
## 10 elf        8 "I do love Christmas movies, always a sucker for them even if t…
## # … with 693 more rows
```
]

---

```r
ggplot(reviews, aes(as.factor(rating), 1, fill = movie)) +
  geom_col() +
  facet_grid(~ movie) +
  theme_minimal() +
  labs(x = "Rating",
       y = "Count") +
  guides(fill = "none")
```

---

## tidymodels

```r
library(tidymodels)
```

```
## ── Attaching packages ──────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
```

```
## ✓ broom     0.5.5     ✓ rsample   0.0.5
## ✓ dials     0.0.4     ✓ tibble    2.1.3
## ✓ infer     0.5.1     ✓ tune      0.0.1
## ✓ parsnip   0.0.5     ✓ workflows 0.1.0
## ✓ purrr     0.3.3     ✓ yardstick 0.0.5
## ✓ recipes   0.1.9
```

```
## ── Conflicts ─────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::discard()  masks scales::discard()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
## x dials::margin()   masks ggplot2::margin()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
```

```r
#devtools::install_github("tidymodels/textrecipes")
library(textrecipes)
```

`tidymodels` is a "meta-package" for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse.

`textrecipes` is an addition to the recipes package providing text processing capabilities (coming to CRAN any day)

---

.medium[

```r
set.seed(2018)

split <- reviews %>%
  mutate(good = factor(rating > 5, labels = c("bad", "good"))) %>%
  select(good, text = review) %>%
  initial_split(props = 7 / 10)

review_train <- training(split)
review_test  <- testing(split)
```
]

Splitting data into training and testing set

Next we specify a preprocesing step using recipes

---

## What do we measure?

![](https://user-images.githubusercontent.com/6179259/47669547-78c9f180-dbee-11e8-85e8-e01cb4cbe46d.png)

---

.medium[

```r
review_rec <- recipe(good ~ ., data = review_train) %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens = 500) %>%
  step_tfidf(text) %>%
  prep(training = review_train)

review_rec
```
]

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1
## 
## Training data contained 528 data points and no missing data.
## 
## Operations:
## 
## Tokenization for text [trained]
## Text filtering for text [trained]
## Term frequency with text [trained]
```

---

```r
# Processed data
train_data <- juice(review_rec)
test_data  <- bake(review_rec, review_test)
```

```r
train_data
```

```
## # A tibble: 528 x 501
##    good  tf_text_1 tf_text_10 tf_text_2 tf_text_5 tf_text_6 tf_text_8 tf_text_a
##    <fct>     <dbl>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 good          0          0         0         0         0         0         4
##  2 good          0          0         0         0         0         0         2
##  3 good          0          0         0         0         0         0         7
##  4 good          0          0         0         0         0         0         7
##  5 good          0          0         0         0         0         0         8
##  6 good          0          0         0         1         0         0        21
##  7 good          0          0         0         0         0         0         5
##  8 good          0          0         0         0         0         0         6
##  9 good          0          0         0         1         0         0         8
## 10 good          0          1         0         0         0         0        22
## # … with 518 more rows, and 493 more variables: tf_text_about <dbl>,
## #   tf_text_absolutely <dbl>, tf_text_act <dbl>, tf_text_acting <dbl>,
## #   tf_text_action <dbl>, tf_text_actor <dbl>, tf_text_actors <dbl>,
## #   tf_text_actually <dbl>, tf_text_adaptation <dbl>, tf_text_added <dbl>,
## #   tf_text_adult <dbl>, tf_text_adults <dbl>, tf_text_after <dbl>,
## #   tf_text_again <dbl>, tf_text_ages <dbl>, tf_text_all <dbl>,
## #   tf_text_almost <dbl>, tf_text_along <dbl>, tf_text_already <dbl>,
## #   tf_text_also <dbl>, tf_text_although <dbl>, tf_text_always <dbl>,
## #   tf_text_am <dbl>, tf_text_amazing <dbl>, tf_text_an <dbl>,
## #   tf_text_and <dbl>, tf_text_animated <dbl>, tf_text_another <dbl>,
## #   tf_text_anthony <dbl>, tf_text_any <dbl>, tf_text_anyone <dbl>,
## #   tf_text_anything <dbl>, tf_text_are <dbl>, tf_text_around <dbl>,
## #   tf_text_as <dbl>, tf_text_asner <dbl>, tf_text_at <dbl>,
## #   tf_text_audience <dbl>, tf_text_away <dbl>, tf_text_baby <dbl>,
## #   tf_text_back <dbl>, tf_text_bad <dbl>, tf_text_based <dbl>,
## #   tf_text_be <dbl>, tf_text_beautiful <dbl>, tf_text_because <dbl>,
## #   tf_text_become <dbl>, tf_text_been <dbl>, tf_text_before <dbl>,
## #   tf_text_being <dbl>, tf_text_believe <dbl>, tf_text_best <dbl>,
## #   tf_text_better <dbl>, tf_text_between <dbl>, tf_text_big <dbl>,
## #   tf_text_bit <dbl>, tf_text_bob <dbl>, tf_text_book <dbl>,
## #   tf_text_both <dbl>, tf_text_brings <dbl>, tf_text_brought <dbl>,
## #   tf_text_buddy <dbl>, `tf_text_buddy's` <dbl>, tf_text_but <dbl>,
## #   tf_text_by <dbl>, tf_text_caan <dbl>, tf_text_came <dbl>,
## #   tf_text_can <dbl>, `tf_text_can't` <dbl>, tf_text_carey <dbl>,
## #   tf_text_carrey <dbl>, `tf_text_carrey's` <dbl>, tf_text_cartoon <dbl>,
## #   tf_text_cast <dbl>, tf_text_character <dbl>, tf_text_characters <dbl>,
## #   tf_text_charming <dbl>, tf_text_child <dbl>, tf_text_childhood <dbl>,
## #   tf_text_children <dbl>, `tf_text_children's` <dbl>,
## #   tf_text_christmas <dbl>, tf_text_cindy <dbl>, tf_text_city <dbl>,
## #   tf_text_classic <dbl>, tf_text_claus <dbl>, tf_text_come <dbl>,
## #   tf_text_comedy <dbl>, tf_text_comes <dbl>, tf_text_completely <dbl>,
## #   tf_text_costumes <dbl>, tf_text_could <dbl>, `tf_text_couldn't` <dbl>,
## #   tf_text_course <dbl>, tf_text_cute <dbl>, tf_text_dad <dbl>,
## #   tf_text_day <dbl>, tf_text_definitely <dbl>, tf_text_department <dbl>,
## #   tf_text_deschanel <dbl>, …
```

---

```r
review_model <- logistic_reg() %>%
  set_engine("glm")
```

```r
review_fit <- review_model %>%
  fit(good ~ ., data = train_data)
```

```r
test_results <- review_test %>%
  bind_cols(
   predict(review_fit, test_data)
  )

test_results %>% accuracy(truth = good, estimate = .pred_class)
```

```
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.583
```

(This is not an example of finished classification. The Accuracy is not that good, but the general steps you would follow.)