Topic Modeling the Sarah Palin Emails

LDA-based Email Browser

Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I’ve been working on some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups.

I threw up a simple demo app to view the organized documents here.

What is Latent Dirichlet Allocation?

Briefly, given a set of documents, LDA tries to learn the latent topics underlying the set. It represents each document as a mixture of topics (generated from a Dirichlet distribution), each of which emits words with a certain probability.

For example, given the sentence “I listened to Justin Bieber and Lady Gaga on the radio while driving around in my car”, an LDA model might represent this sentence as 75% about music (a topic which, say, emits the words Bieber with 10% probability, Gaga with 5% probability, radio with 1% probability, and so on) and 25% about cars (which might emit driving with 15% probability and cars with 10% probability).

If you’re familiar with latent semantic analysis, you can think of LDA as a generative version. (For a more in-depth explanation, I wrote an introduction to LDA here.)

Sarah Palin Email Topics

Here’s a sample of the topics learnt by the model, as well as the top words for each topic. (Names, of course, are based on my own interpretation.)

  • Wildlife/BP Corrosion: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, …
  • Energy/Fuel/Oil/Mining: energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, …
  • Trig/Family/Inspiration: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, …
  • Gas: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, …
  • Education/Waste: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, …
  • Presidential Campaign/Elections: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, …

Here’s a sample email from the wildlife topic:

Wildlife Email

I also thought the classification for this email was really neat: the LDA model labeled it as 10% in the Presidential Campaign/Elections topic and 90% in the Wildlife topic, and it’s precisely a wildlife-based protest against Palin as a choice for VP:

Wildlife-VP Protest

Future Analysis

In a future post, I’ll perhaps see if we can glean any interesting patterns from the email topics. For example, for a quick graph now, if we look at the percentage of emails in the Trig/Family/Inspiration topic across time, we see that there’s a spike in April 2008 – exactly (and unsurprisingly) the month in which Trig was born.

Trig

Edwin Chen
Hanging. MIT, Microsoft Research, Clarium, Twitter, Google.

Email
Twitter
Github
Google+
LinkedIn
Quora


Recent Posts

Propensity Modeling, Causal Inference, and Discovering Drivers of Growth

Improving Twitter Search with Real-Time Human Computation

Edge Prediction in a Social Graph: My Solution to Facebook's User Recommendation Contest on Kaggle

Soda vs. Pop with Twitter

Making the Most of Mechanical Turk: Tips and Best Practices

Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process

Instant Interactive Visualization with d3 + ggplot2

Movie Recommendations and More via MapReduce and Scalding

Quick Introduction to ggplot2

Introduction to Conditional Random Fields