class: center, middle, title-slide # Text Analysis in R ## Special Christmas Edition ### Emil Hvitfeldt ### 2018-10-29 --- # What is Natural Language Processing (NLP) Using computers to extract insights and make decision based on human languages. - Information extraction - Machine translation - Speech processing - Image understanding ??? https://www.ling.upenn.edu/~beatrice/humor/headlines.html Not Computational linguistics which is using computers to reason about human langauges --- # Plan of Action ## Text mining We will be doing text mining as exploratory data analysis ## Modeling Apply some simple models to make decisions only based on text --- background-image: url("http://3.bp.blogspot.com/-FDjkfNyHOCA/UNNVcB9eBsI/AAAAAAAABp4/fivfG6senDU/s1600/fir-tree1.jpg") background-position: 90% 35% background-size: 20% 30% # The Data ## Text mining The Fir Tree H.C. Andersen 21 December 1844 EK (**e**ventyr**k**ode / fairly tale code) = 26 Fairly early, one of first works displaying pessimism. ??? Types of data - Strings - Document term matrix - Corpus --- # The Data ## Modeling Movie reviews from [IMDb.com](IMDb.com) 2 Movies --- .pull-left[ ![](https://m.media-amazon.com/images/M/MV5BMzUxNzkzMzQtYjIxZC00NzU0LThkYTQtZjNhNTljMTA1MDA1L2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_.jpg) ] .pull-right[ ![](https://m.media-amazon.com/images/M/MV5BNWNiNTczNzEtMjQyZC00MjFmLTkzMDMtODk4ZGMyZmE0N2E4XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_.jpg) ] --- # Finding gold .medium[ ```r #devtools::install_github("emilhvitfeldt/hcandersenr") library(hcandersenr) ``` ] Includes most of H.C. Andersens 157 fairly tales in 5 languages (Danish, German, English, Spanish, French). -- .medium[ ```r library(dplyr) ``` ``` ## ## Attaching package: 'dplyr' ``` ``` ## The following objects are masked from 'package:stats': ## ## filter, lag ``` ``` ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ``` ```r library(tidyr) tree <- hcandersen_en %>% filter(book == "The fir tree") %>% select(text) ``` ] --- # The Text .smallish[ ```r tree ``` ``` ## # A tibble: 253 x 1 ## text ## <chr> ## 1 "Far down in the forest, where the warm sun and the fresh air made a sweet" ## 2 "resting-place, grew a pretty little fir-tree; and yet it was not happy, it" ## 3 "wished so much to be tall like its companions– the pines and firs which gre… ## 4 "around it. The sun shone, and the soft air fluttered its leaves, and the" ## 5 "little peasant children passed by, prattling merrily, but the fir-tree heed… ## 6 "them not. Sometimes the children would bring a large basket of raspberries … ## 7 "strawberries, wreathed on a straw, and seat themselves near the fir-tree, a… ## 8 "say, \"Is it not a pretty little tree?\" which made it feel more unhappy th… ## 9 "before." ## 10 "And yet all this while the tree grew a notch or joint taller every year; fo… ## # … with 243 more rows ``` ] --- .pull-left[ ```r library(tidytext) *unnest_tokens(tree, word, text) ``` ``` ## # A tibble: 3,288 x 1 ## word ## <chr> ## 1 far ## 2 down ## 3 in ## 4 the ## 5 forest ## 6 where ## 7 the ## 8 warm ## 9 sun ## 10 and ## # … with 3,278 more rows ``` ] .pull-right[ Observational unit: - Document - Sentence - Word - Letter ] --- .pull-left[ ```r library(tidytext) *unnest_tokens(tree, word, text) ``` ``` ## # A tibble: 3,288 x 1 ## word ## <chr> ## 1 far ## 2 down ## 3 in ## 4 the ## 5 forest ## 6 where ## 7 the ## 8 warm ## 9 sun ## 10 and ## # … with 3,278 more rows ``` ] .pull-right[ Observational unit: - ~~Document~~ - ~~Sentence~~ - **Word** - ~~Letter~~ Word tokens are default in `unnest_tokens()` ] --- ```r library(tidytext) unnest_tokens(tree, word, text) %>% * count(word, sort = TRUE) ``` ``` ## # A tibble: 745 x 2 ## word n ## <chr> <int> ## 1 the 278 ## 2 and 161 ## 3 tree 76 ## 4 it 66 ## 5 a 56 ## 6 in 55 ## 7 of 54 ## 8 to 54 ## 9 i 53 ## 10 was 39 ## # … with 735 more rows ``` --- ```r library(tidytext) unnest_tokens(tree, word, text) %>% * count(word, sort = TRUE) ``` ``` ## # A tibble: 745 x 2 ## word n ## <chr> <int> *## 1 the 278 *## 2 and 161 ## 3 tree 76 *## 4 it 66 *## 5 a 56 *## 6 in 55 *## 7 of 54 *## 8 to 54 *## 9 i 53 *## 10 was 39 ## # … with 735 more rows ``` -- These words don't give us much context --- # Stop words Stop words or "non-context" words are words that doens't add much to the text (filler that make sentences work). ```r stop_words$word ``` ``` ## [1] "a" "a's" "able" "about" ## [5] "above" "according" "accordingly" "across" ## [9] "actually" "after" "afterwards" "again" ## [13] "against" "ain't" "all" "allow" ## [17] "allows" "almost" "alone" "along" ## [21] "already" "also" "although" "always" ## [25] "am" "among" "amongst" "an" ## [29] "and" "another" "any" "anybody" ## [33] "anyhow" "anyone" "anything" "anyway" ... ``` --- # Stop words Stop words or "non-context" words are words that doens't add much to the text (filler that make sentences work). Don't remove stop words willy nilly! Pre-constructed word list might not work in your domain Creating your own word list is hard... ??? Lean computer old --- ```r unnest_tokens(tree, word, text) %>% * inner_join(stop_words, by = "word") %>% count(word, sort = TRUE) %>% top_n(50, n) %>% pull(word) ``` ``` ## [1] "the" "and" "it" "a" "in" "of" "to" "i" "was" ## [10] "they" "were" "all" "with" "but" "on" "then" "had" "is" ## [19] "at" "not" "what" "as" "that" "he" "would" "you" "its" ## [28] "out" "so" "be" "them" "this" "down" "said" "for" "have" ## [37] "how" "we" "been" "very" "when" "where" "who" "or" "there" ## [46] "while" "came" "now" "one" "after" "by" "my" "only" "their" ## [55] "which" ``` Look at the words you remove (easy) or know you stop word by heart (hard!!!) --- ```r unnest_tokens(tree, word, text) %>% * anti_join(stop_words, by = "word") %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 500 x 2 ## word n ## <chr> <int> ## 1 tree 76 ## 2 fir 33 ## 3 branches 14 ## 4 story 14 ## 5 forest 13 ## 6 mice 12 ## 7 trees 12 ## 8 children 10 ## 9 dumpty 10 ## 10 humpty 10 ## # … with 490 more rows ``` --- ```r library(ggplot2) unnest_tokens(tree, word, text) %>% anti_join(stop_words, by = "word") %>% count(word, sort = TRUE) %>% top_n(10, n) %>% ggplot(aes(reorder(word, n), n)) + geom_col() + coord_flip() + theme_light() + labs(x = "Times", y = "Word", title = "Word frequency in 'The Fur Tree'") ``` --- <img src="index_files/figure-html/unnamed-chunk-13-1.png" width="100%" /> --- ```r unnest_tokens(tree, word, text) %>% mutate(pos = row_number(), place = word == "story") %>% filter(place) %>% ggplot(aes(pos, place)) + geom_point() + theme_light() + labs(x = "Position", y = "", title = "Occurence plot of word 'story'") ``` --- <img src="index_files/figure-html/unnamed-chunk-15-1.png" width="90%" /> --- ```r unnest_tokens(tree, word, text) %>% mutate(pos = row_number(), place = case_when(word == "story" ~ "story", word %in% c("tree", "trees") ~ "tree", word == "mice" ~ "mice", word == "children" ~ "children", word == "forest" ~ "forest", TRUE ~ NA_character_)) %>% drop_na() %>% ggplot(aes(pos, place, color = place)) + geom_point() + theme_light() + guides(color = "none") + labs(x = "Position", y = "Word", title = "Occurence plot of 'The Fur Tree'") ``` --- <img src="index_files/figure-html/unnamed-chunk-17-1.png" width="90%" /> --- ## Going to the movies Scraped review (scraping_reviews.Rmd) ```r library(readr) library(tidyr) reviews <- read_csv("review_data.csv") %>% select(movie, rating, review) %>% drop_na() ``` .smallish[ ``` ## # A tibble: 703 x 3 ## movie rating review ## <chr> <dbl> <chr> ## 1 elf 9 "This was my 4th viewing and now ready to review.The story is i… ## 2 elf 10 "This was one of my favorite movies in 2003. Will Ferrell is br… ## 3 elf 7 "Charming Christmas confection--about a boy raised by elves in … ## 4 elf 10 "Will Ferrell does a great job here, and it's the perfect Chris… ## 5 elf 10 "It was so nice to be able to sit down and watch a movie featur… ## 6 elf 6 "This is a good natured holiday comedy that definitely is in th… ## 7 elf 8 "BEWARE OF BOGUS REVIEWS. SOME REVIEWERS HAVE ONLY ONE REVIEW T… ## 8 elf 6 "A nice Christmas-themed movie. Nice is very apt for this title… ## 9 elf 9 "The most helpful thing I can tell you about this movie is that… ## 10 elf 8 "I do love Christmas movies, always a sucker for them even if t… ## # … with 693 more rows ``` ] --- ```r ggplot(reviews, aes(as.factor(rating), 1, fill = movie)) + geom_col() + facet_grid(~ movie) + theme_minimal() + labs(x = "Rating", y = "Count") + guides(fill = "none") ``` <img src="index_files/figure-html/unnamed-chunk-20-1.png" width="90%" /> --- ## tidymodels ```r library(tidymodels) ``` ``` ## ── Attaching packages ──────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ── ``` ``` ## ✓ broom 0.5.5 ✓ rsample 0.0.5 ## ✓ dials 0.0.4 ✓ tibble 2.1.3 ## ✓ infer 0.5.1 ✓ tune 0.0.1 ## ✓ parsnip 0.0.5 ✓ workflows 0.1.0 ## ✓ purrr 0.3.3 ✓ yardstick 0.0.5 ## ✓ recipes 0.1.9 ``` ``` ## ── Conflicts ─────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ── ## x purrr::discard() masks scales::discard() ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() ## x dials::margin() masks ggplot2::margin() ## x yardstick::spec() masks readr::spec() ## x recipes::step() masks stats::step() ``` ```r #devtools::install_github("tidymodels/textrecipes") library(textrecipes) ``` `tidymodels` is a "meta-package" for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse. `textrecipes` is an addition to the recipes package providing text processing capabilities (coming to CRAN any day) --- .medium[ ```r set.seed(2018) split <- reviews %>% mutate(good = factor(rating > 5, labels = c("bad", "good"))) %>% select(good, text = review) %>% initial_split(props = 7 / 10) review_train <- training(split) review_test <- testing(split) ``` ] Splitting data into training and testing set Next we specify a preprocesing step using recipes --- ## What do we measure? ![](https://user-images.githubusercontent.com/6179259/47669547-78c9f180-dbee-11e8-85e8-e01cb4cbe46d.png) --- .medium[ ```r review_rec <- recipe(good ~ ., data = review_train) %>% step_tokenize(text) %>% step_tokenfilter(text, max_tokens = 500) %>% step_tfidf(text) %>% prep(training = review_train) review_rec ``` ] ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 1 ## ## Training data contained 528 data points and no missing data. ## ## Operations: ## ## Tokenization for text [trained] ## Text filtering for text [trained] ## Term frequency with text [trained] ``` --- ```r # Processed data train_data <- juice(review_rec) test_data <- bake(review_rec, review_test) ``` ```r train_data ``` ``` ## # A tibble: 528 x 501 ## good tf_text_1 tf_text_10 tf_text_2 tf_text_5 tf_text_6 tf_text_8 tf_text_a ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 good 0 0 0 0 0 0 4 ## 2 good 0 0 0 0 0 0 2 ## 3 good 0 0 0 0 0 0 7 ## 4 good 0 0 0 0 0 0 7 ## 5 good 0 0 0 0 0 0 8 ## 6 good 0 0 0 1 0 0 21 ## 7 good 0 0 0 0 0 0 5 ## 8 good 0 0 0 0 0 0 6 ## 9 good 0 0 0 1 0 0 8 ## 10 good 0 1 0 0 0 0 22 ## # … with 518 more rows, and 493 more variables: tf_text_about <dbl>, ## # tf_text_absolutely <dbl>, tf_text_act <dbl>, tf_text_acting <dbl>, ## # tf_text_action <dbl>, tf_text_actor <dbl>, tf_text_actors <dbl>, ## # tf_text_actually <dbl>, tf_text_adaptation <dbl>, tf_text_added <dbl>, ## # tf_text_adult <dbl>, tf_text_adults <dbl>, tf_text_after <dbl>, ## # tf_text_again <dbl>, tf_text_ages <dbl>, tf_text_all <dbl>, ## # tf_text_almost <dbl>, tf_text_along <dbl>, tf_text_already <dbl>, ## # tf_text_also <dbl>, tf_text_although <dbl>, tf_text_always <dbl>, ## # tf_text_am <dbl>, tf_text_amazing <dbl>, tf_text_an <dbl>, ## # tf_text_and <dbl>, tf_text_animated <dbl>, tf_text_another <dbl>, ## # tf_text_anthony <dbl>, tf_text_any <dbl>, tf_text_anyone <dbl>, ## # tf_text_anything <dbl>, tf_text_are <dbl>, tf_text_around <dbl>, ## # tf_text_as <dbl>, tf_text_asner <dbl>, tf_text_at <dbl>, ## # tf_text_audience <dbl>, tf_text_away <dbl>, tf_text_baby <dbl>, ## # tf_text_back <dbl>, tf_text_bad <dbl>, tf_text_based <dbl>, ## # tf_text_be <dbl>, tf_text_beautiful <dbl>, tf_text_because <dbl>, ## # tf_text_become <dbl>, tf_text_been <dbl>, tf_text_before <dbl>, ## # tf_text_being <dbl>, tf_text_believe <dbl>, tf_text_best <dbl>, ## # tf_text_better <dbl>, tf_text_between <dbl>, tf_text_big <dbl>, ## # tf_text_bit <dbl>, tf_text_bob <dbl>, tf_text_book <dbl>, ## # tf_text_both <dbl>, tf_text_brings <dbl>, tf_text_brought <dbl>, ## # tf_text_buddy <dbl>, `tf_text_buddy's` <dbl>, tf_text_but <dbl>, ## # tf_text_by <dbl>, tf_text_caan <dbl>, tf_text_came <dbl>, ## # tf_text_can <dbl>, `tf_text_can't` <dbl>, tf_text_carey <dbl>, ## # tf_text_carrey <dbl>, `tf_text_carrey's` <dbl>, tf_text_cartoon <dbl>, ## # tf_text_cast <dbl>, tf_text_character <dbl>, tf_text_characters <dbl>, ## # tf_text_charming <dbl>, tf_text_child <dbl>, tf_text_childhood <dbl>, ## # tf_text_children <dbl>, `tf_text_children's` <dbl>, ## # tf_text_christmas <dbl>, tf_text_cindy <dbl>, tf_text_city <dbl>, ## # tf_text_classic <dbl>, tf_text_claus <dbl>, tf_text_come <dbl>, ## # tf_text_comedy <dbl>, tf_text_comes <dbl>, tf_text_completely <dbl>, ## # tf_text_costumes <dbl>, tf_text_could <dbl>, `tf_text_couldn't` <dbl>, ## # tf_text_course <dbl>, tf_text_cute <dbl>, tf_text_dad <dbl>, ## # tf_text_day <dbl>, tf_text_definitely <dbl>, tf_text_department <dbl>, ## # tf_text_deschanel <dbl>, … ``` --- ```r review_model <- logistic_reg() %>% set_engine("glm") ``` ```r review_fit <- review_model %>% fit(good ~ ., data = train_data) ``` ```r test_results <- review_test %>% bind_cols( predict(review_fit, test_data) ) test_results %>% accuracy(truth = good, estimate = .pred_class) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.583 ``` (This is not an example of finished classification. The Accuracy is not that good, but the general steps you would follow.)