Filtering for English Tweets: Unsupervised Language Detection on Twitter

(See a demo here.)

While working on a Twitter sentiment analysis project, I ran into the problem of needing to filter out all non-English tweets. (Asking the Twitter API for English-only tweets doesn’t seem to work, as it nonetheless returns tweets in Spanish, Portuguese, Dutch, Russian, and a couple other languages.)

Since I didn’t have any labeled data, I thought it would be fun to build an unsupervised language classifier. In particular, using an EM algorithm to build a naive Bayes model of English vs. non-English n-gram probabilities turned out to work quite well, so here’s a description.

EM Algorithm

Let’s recall the naive Bayes algorithm: given a tweet (a set of character n-grams), we estimate its language to be the language $L$ that maximizes

$$P(language = L | ngrams) \propto P(ngrams | language = L) P(language = L)$$

Thus, we need to estimate $P(ngram | language = L)$ and $P(language = L)$.

This would be easy if we knew the language of each tweet, since we could estimate

  • $P(xyz| language = English)$ as #(number of times “xyz” is a trigram in the English tweets) / #(total trigrams in the English tweets)
  • $P(language = English)$ as the proportion of English tweets.

Or, it would also be easy if we knew the n-gram probabilities for each language, since we could use Bayes’ theorem to compute the language probabilities for each tweet, and then take a weighted variant of the previous paragraph.

The problem is that we know neither of these. So what the EM algorithm says is that that we can simply guess:

  • Pretend we know the language of each tweet (by randomly assigning them at the beginning).
  • Using this guess, we can compute the n-gram probabilities for each language.
  • Using the n-gram probabilities for each language, we can recompute the language probabilities of each tweet.
  • Using these recomputed language probabilities, we can recompute the n-gram probabilities.
  • And so on, recomputing the language probabilities and n-gram probabilities over and over. While our guesses will be off in the beginning, the probabilities will eventually converge to (locally) minimize the likelihood. (In my tests, my language detector would sometimes correctly converge to an English detector, and sometimes it would converge to an English-and-Dutch detector.)

EM Analogy for the Layman

Why does this work? Suppose you suddenly move to New York, and you want a way to differentiate between tourists and New Yorkers based on their activities. Initially, you don’t know who’s a tourist and who’s a New Yorker, and you don’t know which are touristy activities and which are not. So you randomly place people into two groups A and B. (You randomly assign all tweets to a language)

Now, given all the people in group A, you notice that a large number of them visit the Statue of Liberty; similarly, you notice that a large number of people in group B walk really quickly. (You notice that one set of words often has the n-gram “ing”, and that another set of words often has the n-gram “ias”; that is, you fix the language probabilities for each tweet, and recompute the n-gram probabilities for each language.)

So you start to put people visiting the Statue of Liberty in group A, and you start to put fast walkers in group B. (You fix the n-gram probabilities for each language, and recompute the language probabilities for each tweet.)

With your new A and B groups, you notice more differentiating factors: group A people tend to carry along cameras, and group B people tend to be more finance-savvy.

So you start to put camera-carrying folks in group A, and finance-savvy folks in group B.

And so on. Eventually, you settle on two groups of people and differentiating activities: people who walk slowly and visit the Statue of Liberty, and busy-looking people who walk fast and don’t visit. Assuming there are more native New Yorkers than tourists, you can then guess that the natives are the larger group.

Results

I wrote some Ruby code to implement the above algorithm, and trained it on half a million tweets, using English and “not English” as my two languages. The results looked surprisingly good from just eyeballing:

Example Results

But in order to get some hard metrics and to tune parameters (e.g., n-gram size), I needed a labeled dataset. So I pulled a set of English-language and Spanish-language documents from Project Gutenberg, and split them to form training and test sets (the training set consisted of 2000 lines of English and 1000 lines of Spanish, and 1000 lines of English and 1000 lines of Spanish for the test set).

Trained on bigrams, the detector resulted in:

  • 991 true positives (English lines correctly classified as English)
  • 9 false negatives (English lines incorrectly classified as Spanish
  • 11 false positives (Spanish lines incorrectly classified as English)
  • 989 true negatives (Spanish lines correctly classified as English)

for a precision of 0.989 and a recall of 0.991.

Trained on trigrams, the detector resulted in:

  • 992 true positives
  • 8 false negatives
  • 10 false positives
  • 990 true negatives

for a precision of 0.990 and a recall of 0.992.

Also, when I looked at the sentences the detector was making errors on, I saw that they almost always consisted of only one or two words (e.g., the incorrectly classified sentences were lines like “inmortal”, “autumn”, and “salir”). So the detector pretty much never made a mistake on a normal sentence!

Code/Demo

I put the code on my Github account, and a quick demo app, trained on trigrams from tweets with lang=”en” according to the Twitter API, is here.

Edwin Chen

Surge AI CEO: data labeling and RLHF, designed for the next generation of AI.


Need high-quality, human-powered data? We help top AI and LLM companies around the world create powerful, human-labeled datasets.


Ex: AI, data science at Google, Facebook, Twitter, Dropbox, MSR. Pure math and linguistics at MIT.


Surge AI
Surge AI Blog
Surge AI Twitter
Surge AI LinkedIn
Surge AI Github

Twitter
LinkedIn
Github
Quora
Email

Recent Posts

A Visual Tool for Exploring Word Embeddings

Surge AI: A New Data Labeling Platform and Workforce for NLP

How Could Facebook Align its ML Systems to Human Values? A Data-Driven Approach

Exploring LSTMs

Moving Beyond CTR: Better Recommendations Through Human Evaluation

Propensity Modeling, Causal Inference, and Discovering Drivers of Growth

Product Insights for Airbnb

Improving Twitter Search with Real-Time Human Computation

Edge Prediction in a Social Graph: My Solution to Facebook's User Recommendation Contest on Kaggle

Soda vs. Pop with Twitter

Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process

Instant Interactive Visualization with d3 + ggplot2

Movie Recommendations and More via MapReduce and Scalding

Quick Introduction to ggplot2

Introduction to Conditional Random Fields

Winning the Netflix Prize: A Summary

Stuff Harvard People Like

Information Transmission in a Social Network: Dissecting the Spread of a Quora Post

Introduction to Latent Dirichlet Allocation

Introduction to Restricted Boltzmann Machines

Topic Modeling the Sarah Palin Emails

Filtering for English Tweets: Unsupervised Language Detection on Twitter

Choosing a Machine Learning Classifier

Kickstarter Data Analysis: Success and Pricing

A Mathematical Introduction to Least Angle Regression

Introduction to Cointegration and Pairs Trading

Counting Clusters

Hacker News Analysis

Layman's Introduction to Measure Theory

Layman's Introduction to Random Forests

Netflix Prize Summary: Factorization Meets the Neighborhood

Netflix Prize Summary: Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

Prime Numbers and the Riemann Zeta Function

Topological Combinatorics and the Evasiveness Conjecture

Item-to-Item Collaborative Filtering with Amazon's Recommendation System