Following my post on repeated phrases in HBO’s Westworld, I was curious to see what else I could do with the transcipts dataset (made available as an .RData file here). In that vain, I attempted to make a word cloud for each episode. However, even removing common english words, word clouds based solely on word frequency were completely boring. There was too much common spoken terms to get any information about what made that episode unique, especially in the context of the show.

The final metric that I settled on for the below word cloud is a measure of how frequent a word appears in a given episode, relative to how common it is in the show. The exact formula is given below:

\[ \begin{aligned} \mbox{Metric} &= (\mbox{Freq in Episode}) * (\mbox{Relative Freq in Episode}) \\ &= (\mbox{Freq in Episode}) * {(\mbox{Freq in Episode}) \over (\mbox{Freq in All Episodes})} \end{aligned} \] where \((\mbox{Freq in All Episodes})\) includes the frequency for all episodes up to and including the episode of interest. This metric lead to a rather interesting word cloud, which seems to highlight words important to the episode, which give a good sense for what the episode was about. For those curious about the color scheme, I use a color palette based on the episode cover image (see above), and color all words with a Metric of 1 or more.

It is clear that S2E09 of Westworld was very much about Emily and mom, with a heavy dose of darkness, pretending, and Plutarch. If you enjoyed this brief post, make sure to follow me on Twitter @Sean__Kent for more data analysis, visualizations, and Westworld. I’d love to hear comments about what to do next with this data, or other visualizations that you’d like to see!

Appendix - R Code

## Setup
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message=FALSE)
knitr::opts_chunk$set(fig.width=10, fig.height=5) 
options(knitr.kable.NA = '')
library(knitr)
## load in packages (and one github repo)
library(tidyverse)
library(rvest)
library(data.table)
library(tm)
library(wordcloud2)
# library(devtools)
# install_github("andreacirilloac/paletter")
library(paletter)

load(url("http://pages.stat.wisc.edu/~kent/blog/2018.06.25_Westworld_wc_s2e09/westworld.RData"))
## make data into a tibble (dataframe)
data.dt <- data %>% 
  lapply(., FUN = data.frame, stringsAsFactors = FALSE) %>%
  setNames(c(paste0("s01e",1:10), paste0("s02e",1:9)) ) %>%
  bind_rows(.id = "episode") %>%
  as.tibble %>%
  setNames(c("episode", "line"))

## add row with no speaker tag, punctuation, or capitalization with parentheticals removed
data.dt <- data.dt %>%
  mutate(line_clean = gsub(pattern = "\\([^\\)]+\\)", x = as.character(line), replacement = NA)) %>%
  mutate(line_clean = gsub(pattern = ".*: ", x = line_clean, replacement = "")) %>%
  mutate(line_clean = gsub("[\\.\\',!\\?-]", x=line_clean, replacement = "")) %>%
  mutate(line_clean = trimws(line_clean)) %>%
  mutate(line_clean = tolower(line_clean)) %>% 
  mutate(n_words = str_count(line_clean, "\\S+"))

## Manually code line corresponding to the end of recap
recap_info <- tibble(season.episode = unique(data.dt$episode),
                     end_of_recap = c(0,  0,  17, 0,  0,  0, 0, 0,  0, 0,
                                  79, 18, 22, 23, 38, 9, 9, 10, 8)  )

## add line number to each episode to the tibble
data.dt.norecap.pre <- data.dt %>% 
  group_by(episode) %>% 
  mutate(ep_line_num = row_number()) %>%
  ungroup()

## Remove beginning of episode recaps via filtering
data.dt.norecap <- integer(0)
for(row in 1:nrow(recap_info)) {
  piece <- data.dt.norecap.pre %>%
    filter(episode == as.character(recap_info[row,"season.episode"])) %>%
    filter(ep_line_num > as.integer(recap_info[row,"end_of_recap"]))
  data.dt.norecap <- bind_rows(data.dt.norecap, piece)
}


get.word.frequency <- function(line.vector) { 
  output <- line.vector %>%
  VectorSource %>% Corpus %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(removePunctuation, preserve_intra_word_contractions = TRUE) %>%
  TermDocumentMatrix %>%
  as.matrix %>%
  sort(x=rowSums(.), decreasing = TRUE) %>%
  data.frame(word = names(.), freq=., row.names = NULL)
  
  return(output)
}


word.freq.all <- data.dt.norecap %>%
  select(line_clean) %>%
  get.word.frequency

episode.of.interest <- "s02e9"

word.freq.episode <- data.dt.norecap %>%
  filter(episode == episode.of.interest) %>%
  select(line_clean) %>%
  get.word.frequency

word.freq.combined <- word.freq.episode %>%
  left_join(word.freq.all, by=c("word"="word"), suffix=c(".episode",".all")) %>%
  mutate(rel.freq = freq.episode/freq.all,
         metric = freq.episode^2 / freq.all) %>%
  filter(!is.na(rel.freq)) %>%
  arrange(desc(metric))
## make color palette for the word cloud
n_words <- sum(word.freq.combined$metric >1)
ww_color_scheme <- create_palette(image_path = "s02e09.jpg",
               number_of_colors = n_words,
               type_of_variable = "categorical")
set.seed(8)
word.cloud <- word.freq.combined %>% select(word, metric) %>%
  wordcloud2(color=ww_color_scheme, shuffle=FALSE)

#install webshot
library(webshot)
webshot::install_phantomjs()

# save it in html
library(htmlwidgets)
saveWidget(word.cloud, "tmp.html", selfcontained = F)
 
# and in pdf
webshot("tmp.html",paste0("word_cloud_",episode.of.interest,".png"), delay =30, vwidth = 800, vheight=494)
Find me
Website
Contact Info

Copyright 2017 Sean Kent All Rights Reserved | Design By W3layouts