Simpsons Character Text Analysis

12 minute read

Published:

How do the characters of The Simpsons interact? What type of language do they use? Has this changed much since the first episode in 1989?

The techniques used in this analysis are intended to demonstrate text analysis using the tidy text philosophy by Silge and Robinson from Text Mining with R.

Background

The Simpsons is an American animated sitcom. The series is a satirical depiction of working-class life, epitomized by the Simpson family, which consists of Homer, Marge, Bart, Lisa, and Maggie.

Although if you didn’t already know that, this analysis probably isn’t for you.

Data Source

The dataset used contains the characters, locations, episode details, and script lines for approximately 600 Simpsons episodes, dating back to 1989.

The data was downloaded from Kaggle at https://www.kaggle.com/wcukierski/the-simpsons-by-the-data/data as a single .zip file and stored in the ./raw directory for reproducibility.

Data Preparation

First we load the required packages:

library(tidyverse)  # tidy data wrangling
library(tidytext) # tidy text analysis 
library(widyr) # Widen, Process, then Re-Tidy Data
library(igraph) # Network Analysis and Visualization
library(ggraph) # An Implementation of Grammar of Graphics for Graphs and Networks
library(ggthemes) # extra ggplot themes

The four separate files are read into memory after being unzipped.

unzip('./raw/the-simpsons-by-the-data.zip', exdir = './data')  

script <- read_csv('./data/simpsons_script_lines.csv')
characters <- read_csv('./data/simpsons_characters.csv')
episodes <- read_csv('./data/simpsons_episodes.csv')
locations <- read_csv('./data/simpsons_locations.csv')

Cleaning and merging data into one data frame called moments which are essentially all script lines with associated normalised location, character and episode data.

moments <- script %>% 
  left_join(locations, by = c('location_id' = 'id')) %>% 
  left_join(characters, by = c('character_id' = 'id')) %>% 
  left_join(episodes, by = c('episode_id' = 'id')) %>% 
  select(id, 
         season_number = season,
         episode_in_season = number_in_season,
         episode_in_series = number_in_series,
         episode_title = title,
         line_number = number, 
         timestamp_in_ms,
         words = normalized_text,
         word_count,
         location = normalized_name.x, 
         character = normalized_name.y, 
         gender, 
         original_air_date) %>% 
  mutate(word_count = as.integer(word_count)) %>% 
  arrange(id)

Preview of the data.

glimpse(moments)
## Observations: 157,462
## Variables: 13
## $ id                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ season_number     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ episode_in_season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ episode_in_series <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ episode_title     <chr> "Simpsons Roasting on an Open Fire", "Simpso...
## $ line_number       <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ timestamp_in_ms   <int> 8000, 8000, 8000, 10000, 10000, 24000, 24000...
## $ words             <chr> NA, NA, "ooo careful homer", "theres no time...
## $ word_count        <int> NA, NA, 3, 6, 2, NA, NA, 5, 21, 3, 17, 3, 27...
## $ location          <chr> "street", "car", "car", "car", "car", "sprin...
## $ character         <chr> NA, NA, "marge simpson", "homer simpson", "h...
## $ gender            <chr> NA, NA, "f", "m", "m", NA, NA, "f", "m", "m"...
## $ original_air_date <date> 1989-12-17, 1989-12-17, 1989-12-17, 1989-12...

What are characters saying? - Term Frequency Analysis

Here we aim to find the most characteristic words for the top characters. This is done using a technique called Term Frequency - Inverse Document Frequency (tf-idf) analysis.

This technique finds commonly used terms for each character, but it applies a weighting based on the term’s overall ubiquity. For instance a globally common word is weighted down, but a common term for one character that is rarely used overall is weighted up. This provides a ranking of particular terms used by each character.

The first step is to tokenize each line of script into a tidy format. This means transforming it to one line per word in this case.

tidy_lines <- moments %>% 
  select(season_number, 
         episode_in_series, 
         line_number,
         character, 
         line = words) %>% 
  filter(complete.cases(.)) %>% 
  unnest_tokens(word, line) 

tidy_lines
## # A tibble: 1,290,020 x 5
##    season_number episode_in_series line_number character     word   
##            <int>             <int>       <int> <chr>         <chr>  
##  1             1                 1           2 marge simpson ooo    
##  2             1                 1           2 marge simpson careful
##  3             1                 1           2 marge simpson homer  
##  4             1                 1           3 homer simpson theres 
##  5             1                 1           3 homer simpson no     
##  6             1                 1           3 homer simpson time   
##  7             1                 1           3 homer simpson to     
##  8             1                 1           3 homer simpson be     
##  9             1                 1           3 homer simpson careful
## 10             1                 1           4 homer simpson were   
## # ... with 1,290,010 more rows

We really only care about the top 10 characters, so let’s find out who they are based on total word count.

top10 <- tidy_lines %>% 
  select(character, word) %>% 
  count(character, sort = TRUE) %>% 
  top_n(10, n)

top10
## # A tibble: 10 x 2
##    character               n
##    <chr>               <int>
##  1 homer simpson      273638
##  2 marge simpson      125327
##  3 bart simpson       110760
##  4 lisa simpson       100186
##  5 c montgomery burns  36809
##  6 moe szyslak         33026
##  7 seymour skinner     28324
##  8 ned flanders        23289
##  9 krusty the clown    20978
## 10 grampa simpson      20250

We can see the Simpson family are the most frequent characters of course, with various others from the community filling out the top 10.

Next we conduct a semi-join to keep only the tidy_words used by these top 10 characters. Also we discard ‘stop words’ which are commonly used words like ‘a’, ‘the’ ‘if’.

tidy_lines_top10 <- tidy_lines %>% 
  count(character, word, sort = TRUE) %>% 
  ungroup() %>% 
  semi_join(top10, by = "character") %>% 
  anti_join(stop_words)

Now attaching the tf_idf scores and plotting by character.

# calculating tf-idf score
tfidf <- tidy_lines_top10 %>% 
   bind_tf_idf(word, character, n)

# displaying top results by tf_idf
tfidf %>% 
  group_by(character) %>% 
  top_n(9, tf_idf) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(word, tf_idf, fill = character)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~character, ncol = 2, scales = "free") +
  coord_flip() +
  scale_fill_manual(values = c(
    "bart simpson" = "#f04830", "c montgomery burns" = "#007878", "grampa simpson" = "#d89090", 
    "homer simpson" = "#78d8ff", "krusty the clown" = "#009078", "lisa simpson" = "#ffd818",
    "marge simpson" = "#d8f0a8", "moe szyslak" = "#787878", "ned flanders" = "#906030","seymour skinner" = "#606078")) +
  theme_minimal() +
  labs(x = NULL, 
       y = "tf-idf", 
       title = "What are the key terms used by Simpsons characters?",
       subtitle = "Top 10 characters from 600 episodes 1989 - 2016") 

A few nice observations are:

  • Characters’ loved ones really comes out on top, along with particular nicknames.
  • Ned Flanders’ results are accurate and hilarious
  • You get a real personal sense for these characters and what’s important to them.

Given the method penalises overly common words, you get to dip deep into the well of words used by each character which is quite interesting.

How are they saying it? - Sentiment Analysis

Next we look at the sentiment behind the characters’ lines over time.

Sentiment is determined by joining a sentiment lexicon to the tidy text data set. These lexicons are manually created dictionaries of words with a score (in this case -5 to 5), expressing sentiment (negative to positive in this case).

We group by episode here to evaluate changes over time.

# creating a new tidy dataset with episode included
tidy_eps <- tidy_lines %>% 
  count(character, episode_in_series, word, sort = TRUE) %>% 
  ungroup() %>% 
  semi_join(top10, by = "character") %>% 
  anti_join(stop_words) 

# joining the AFINN sentiment lexicon
simpson_sentiment <- tidy_eps %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(character, episode_in_series) %>% 
  summarise(sentiment = sum(score))

# plotting
ggplot(simpson_sentiment, aes(episode_in_series, sentiment, colour = character)) +
  geom_bar(stat= "identity",show.legend = FALSE) +
  facet_wrap(~ character) +
  scale_colour_manual(values = c(
    "bart simpson" = "#f04830", "c montgomery burns" = "#007878", "grampa simpson" = "#d89090", 
    "homer simpson" = "#78d8ff", "krusty the clown" = "#009078", "lisa simpson" = "#ffd818",
    "marge simpson" = "#d8f0a8", "moe syzslak" = "#787878", "ned flanders" = "#906030","seymour skinner" = "#606078")) +
  theme_minimal() +
  labs(title = 'How does sentiment change in Simpons characters over time?',
       subtitle = "Top 10 characters from 600 episodes 1989 - 2016",
       x = 'Episode',
       y = 'Net Sentiment')

I can’t detect any massive changes, but certainly some characters have the odd episode where they really exhibit a change.

For example:

  • Moe has a large positive spike around episode 450. On inspection this is episode 436 Eeny Teeny Maya Moe where Moe falls in love with a woman over the Internet.

  • The normally mild-mannered Ned Flanders has a negative spike around episode 50. This turns out to be episode 51: Bart the Lover where Ned and Homer get into a conflict over Homer’s bad language.

How are characters connected? - Network Graph

Understanding which characters are ‘connected’ can be done by analysing a network graph.

We can take the pairwise correlation of characters using a measure called the ‘phi coefficient’, which is like a binary version of Pearson Correlation. It looks at how commonly two characters are at the same place at the same time, or absent together, versus one being present without the other.

It can be represented by a network graph which provides excellent interpretation.

# selecting distinct combination of characters in the same location during an episode.
interactions <- moments %>% 
  group_by(character) %>% 
  mutate(word_count_total = sum(as.numeric(word_count), na.rm = TRUE)) %>% 
  select(episode_in_season, location, word_count_total, char = character) %>% 
  ungroup() %>% 
  filter(complete.cases(.)) %>% 
  unique() %>% 
  bind_cols(scene = group_indices(., episode_in_season, location)) 
 
# finding pairwise correlations
correlations <- interactions %>% 
  group_by(char) %>% 
  filter(n() >= 10) %>% 
  pairwise_cor(char, scene, sort = TRUE) %>% 
  filter(correlation > 0.25)

# custom verticies for graph
verticies <- correlations %>% 
  gather(col, character) %>% 
  inner_join(interactions, by = c('character' = 'char')) %>% 
  select(character, word_count_total) %>% 
  unique()

# make graph
correlations %>% 
  graph_from_data_frame(vertices = verticies) %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), 
                 edge_colour = "#78d8ff", 
                 show.legend = FALSE) +
  geom_node_point(aes(size = (word_count_total)), 
                  color = "#ffd818", 
                  show.legend = FALSE) +
  geom_node_text(aes(label = name), 
                 repel = TRUE, 
                 size = 3, 
                 segment.alpha = 0.5, 
                 colour = "white") +
  labs(title = "The Simpsons Character Network",
       subtitle = "Highly connected characters from 1989 - 2016",
       caption = "@deanmarchiori") +
  theme_void() +
  theme(plot.background = element_rect(fill = 'black'),
        plot.title = element_text(colour = "white"),
        plot.subtitle  = element_text(colour = "white"),
        plot.caption  = element_text(colour = "white"))

Interestingly the Simpson family have few external connections. The exception is Bart who is seen with his nerdy friend Milhouse a lot. This links to a tight cluster a school yard friends and teachers.

Other interesting features are Moe and the ‘barflies’ with a stronger connection between Lenny and Carl due to their close friendship and Flanders’ family dependence on Rev. Lovejoy.

Conclusions

Overall, as an outsider this may seem baffling without context, but I feel it would arm you with just enough knowledge to explore what is truly interesting about the main characters.

References

http://toddwschneider.com/posts/the-simpsons-by-the-data/

https://github.com/toddwschneider/flim-springfield

https://www.tidytextmining.com/

Reproducibility

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2    ggthemes_3.2.0  ggraph_1.0.1    igraph_1.0.1   
##  [5] widyr_0.1.0     tidytext_0.1.7  forcats_0.2.0   stringr_1.2.0  
##  [9] dplyr_0.7.4     purrr_0.2.4     readr_1.1.1     tidyr_0.8.0    
## [13] tibble_1.4.2    ggplot2_2.2.1   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] ggrepel_0.7.0     Rcpp_0.12.15      lubridate_1.7.1  
##  [4] lattice_0.20-34   utf8_1.1.3        assertthat_0.2.0 
##  [7] digest_0.6.15     psych_1.6.6       ggforce_0.1.1    
## [10] R6_2.2.2          cellranger_1.1.0  plyr_1.8.4       
## [13] evaluate_0.9      httr_1.3.1        pillar_1.1.0     
## [16] rlang_0.1.6       lazyeval_0.2.1    readxl_1.0.0     
## [19] rstudioapi_0.7    Matrix_1.2-7.1    rmarkdown_1.1    
## [22] labeling_0.3      udunits2_0.13     munsell_0.4.3    
## [25] broom_0.4.3       compiler_3.4.3    janeaustenr_0.1.5
## [28] modelr_0.1.1      pkgconfig_2.0.1   mnormt_1.5-4     
## [31] htmltools_0.3.6   gridExtra_2.0.0   crayon_1.3.4     
## [34] MASS_7.3-44       SnowballC_0.5.1   grid_3.4.3       
## [37] nlme_3.1-128      jsonlite_1.5      gtable_0.2.0     
## [40] magrittr_1.5      formatR_1.2.1     units_0.5-1      
## [43] scales_0.5.0      tokenizers_0.1.4  cli_1.0.0        
## [46] stringi_1.1.6     reshape2_1.4.3    viridis_0.3.4    
## [49] xml2_1.2.0        tools_3.4.3       glue_1.2.0       
## [52] tweenr_0.1.5      hms_0.4.1         parallel_3.4.3   
## [55] yaml_2.1.13       colorspace_1.3-2  rvest_0.3.2      
## [58] knitr_1.14        bindr_0.1         haven_1.1.1