A Visual, Layman's Introduction to Language Models in NLP

(This is a crosspost from the official Surge AI blog, where we're building the greatest source of NLP content. If you need help with data labeling and NLP, say hello!)


Language models are a core component of NLP systems, from machine translation to speech recognition. Intuitively, you can think of language models as answering: “How English is this phrase?”

Take, for example, these two sentences: “the cat runs up the tree” and “cat up the runs tree the”. A good language model should assign a higher probability to the former since it’s the more “English”-sounding one. This scoring can then be used in dozens of downstream applications.

For example, speech recognition systems need to disambiguate between phonetically similar phrases like “recognize speech” and “wreck a nice beach”, and a language model can help pick the one that sounds the most natural in a given context. For instance, a speech recognition system transcribing a lecture on audio systems should likely prefer “recognize speech”, whereas a news flash about an extraterrestrial invasion of Miami should likely prefer “wreck a nice beach”.

Wrecking a nice beach

Recognizing speech

Similarly, a French-English machine translation system could leverage a language model to rank translation candidates according to their fluency. If two translations for “je t’aime” are “I you love” and “I love you”, an English language model would hopefully pick the latter as the better translation.

So how do language models work?

Intuition: Thinking Like an Alien

Imagine you’re an alien whose spaceship crashes into Earth. You’re far from home, and you need to blend in until the rescue team arrives. You want to pick up some food, maybe watch Squid Game to learn about human culture, and so you need to learn how to speak like an earthling first.

How do you do this? You turn to your two robot assistants. They’re advanced probability machines, and you hope that they can also figure out human language:

— — — — — — — — — —

You: “Hey, robots.”

Robots: “Beep”

You: “Okay, listen up. I want to buy a burger. Now, when I get to the counter, I want each of you to come up with a couple of sentences you think I could say, along with your best guess of the probability that a human would say it.

Robots: “Beep?”

You: “Yes, you have to.”

— — — — — — — — — —

You’re hoping your robots will:

  • Assign a high probability to responses that lead to delicious food. (For example: “Two cheeseburgers and an order of fries”)
  • Assign a low probability to unintelligible responses that lead to fear, confusion, and a call to the Men in Black. (For example: “Fries Santa cheese dirt hello”)

In other words, you want your robot assistants to act as language models: given any piece of language as input, they’ll score how “human” it sounds.

A language model scoring two responses

Of course, since your robots haven’t learned English yet, they initially won’t be very good. “Two cheeseburgers and an order of fries” sounds just as human to them as “fries Santa cheese dirt hello,” or any other mishmash of noise.

So, to improve your two robots, the three of you spend an hour observing customers at Shake Shack. Over that hour, five customers arrive. Their responses to the cashier’s prompt are:

  • Customer 1: “Two cheeseburgers”
  • Customer 2: “Two cheeseburgers”
  • Customer 3: “Two cheeseburgers”
  • Customer 4: “Fries”
  • Customer 5: “The daily special”

What do your robots learn from these 5 pieces of data?

Robot A comes from your home planet, a world of elite scientists who’ve mastered the mathematics of spaceflight, and it takes a purely probabilistic approach. If you ask it to score every potential response to “What would you like to order today?”, it would say:

  • P(“two cheeseburgers”) = 3 / 5 (since “two cheeseburgers” was uttered in 3 out of the 5 interactions)
  • P(“fries”) = 1 / 5 (since “fries” was uttered in 1 out of the 5 interactions)
  • P(“the daily special”) = 1/5 (since “the daily special” was uttered in 1 out of the 5 interactions)
  • P(anything else) = 0

Robot B, in contrast, comes from a universe where the dimensions of space are warped, so it ignores the order of words. For example, it thinks “two cheeseburgers” and “cheeseburgers two” mean the same thing. (This is analogous to “bag-of-words” models like Naive Bayes.) So it’s as if it heard the following:

  • Customer 1: “two cheeseburgers”, “cheeseburgers two”‍
  • Customer 2: “two cheeseburgers”, “cheeseburgers two”‍
  • Customer 3: “two cheeseburgers”, “cheeseburgers two”‍
  • Customer 4: “fries”‍
  • Customer 5: “the daily special”, “the special daily”, “daily the special”, “daily special the”, “special the daily”, “special daily the”

So if you asked it to score every potential response to “What would you like to order today?” it would assign these probabilities:

  • P(“two cheeseburgers”) = P(“cheeseburgers two”) = 3 / 13
  • P(“fries”) = 1 / 13
  • P(“the daily special”) = P(“the special daily”) = P(“special daily the”) = P(“special the daily”) = P(“daily special the”) =P(“daily the special”) = 1 /13
  • P(anything else) = 0

Evaluating Language Models

One question, then, is: which of your robots performs better? Remember that “two cheeseburgers” and “cheeseburgers two” sound equally valid to an uninformed alien!

So how should you evaluate your two robots to see which is the better language model to use?

Human Evaluation

One approach is a human evaluation approach. Because your robots are trying to imitate human language, why not ask humans how good their imitations are? So you stand outside Shake Shack, and every time a customer approaches, you ask your robot to generate an output and the customer to evaluate it. If the customer thinks it’s a good, human-like response, they’ll assign it a score of +1; otherwise, they’ll score it 0.

For example:

  • At the approach of the first customer: Robot A and Robot B both say “two cheeseburgers”. The customer thinks this is a very human response, so scores both of these as 1 (human-like). A: 1.0, B: 1.0.
  • With the second customer: Robot A says “fries” (score: 1), while Robot B says “cheeseburgers two” (score: 0). A: 1.0, B: 0.0.
  • With the third customer: Robot A says “fries” (score: 1), while Robot B says “daily the special” (score: 0). A: 1.0, B: 0.‍0.

One way to measure the quality of a language model is by asking human judges.

Thus, Robot A would have an average score of (1.0 + 1.0 + 1.0) / 3 = 1.0, while Robot B would have an average score of (1.0 + 0.0 + 0.0) / 3 = 0.333. When evaluated by humans, Robot A is superior!

Task-Specific Evaluation

Another approach would be to evaluate the outputs against a downstream, real-world task. In our alien situation, your goal is to get food from Shake Shack, so you could measure whether or not your robots help you achieve that goal.

For example:

  • Robot A goes up to the counter. When the cashier asks “What would you like to order today?”, it outputs “two cheeseburgers”. The cashier understands and gives him two cheeseburgers. Success! A: 1.0.
  • Robot A goes up to the counter again. This time, it says “fries”. The cashier understands and gives him a fresh bag of fries. Success again! A: 1.0.
  • Next, Robot B goes up to the counter and says “cheeseburgers two”. The cashier doesn’t understand, so gives him nothing. Failure! B: 0.0.
  • Robot B tries again with “the daily special”. The cashier understands this time, so gives him the Tuesday Taco. Success! B: 1.0.

In this task-based evaluation, the better language model leads to actual food.

So under this evaluation method, Robot A scores (1.0 + 1.0) / 2 = 1.0, while Robot B scores (0.0 + 1.0) / 2 = 0.5. Again, we find that Robot A’s language model is superior.

Intrinsic Evaluation and Perplexity

Human evaluations and task-based evaluations are often the most robust way to measure your robots’ performance. But sometimes you want a faster and dirtier way of comparing language models; maybe you don’t have the means to get humans to score your robots’ output, and you can’t risk blowing your cover as an alien with a bad response at Shake Shack.

This is where intrinsic evaluations come into play. One type of intrinsic evaluation works by measuring how perplexed your robots are when they encounter responses by Shake Shack customers.

For example, imagine that as part of their training, Robot A and Robot B hear only a single sentence: “two cheeseburgers and a coke”. The next day, they hear someone say “a coke and two cheeseburgers”. Robot B isn’t surprised at all to hear this response — after all, it ignores the order of words, and so it thinks that “two cheeseburgers and a coke” and “a coke and two cheeseburgers” are equivalent. However, Robot A is very perplexed since to it, “two cheeseburgers and a coke” and “a coke and two cheeseburgers” are as different as “two cheeseburgers” and “cheeseburgers two”.

The better language model is less perplexed.

Lower perplexity means a better language model, so in this example, Robot B finally beats Robot A!

This scenario also shows why language model evaluation is difficult: which language model appears better depends highly on your evaluation methods and your goals.


In summary, this post provided an overview of a couple key concepts surrounding language models:

  • First, we defined a language model as an algorithm that scores how “human” a sentence is. (More formally, a language model maps pieces of texts to probabilities.)
  • We described a way to train language models: by observing language and turning these observations into probabilities.
  • We discussed a couple approaches to evaluating the quality of language models: human evaluation (did the robot responses sound natural to a human?), downstream tasks (did the robot responses lead to actual food?), and intrinsic evaluations (how perplexed were the robots by the human utterances?).

In future posts, we’ll dive into the mathematics and science behind these concepts. If you enjoyed this post, follow us on Twitter at @HelloSurgeAI! And if you need a high-quality data labeling platform and workforce, used by top AI companies and research labs around the world, reach out to us at Surge AI.

Edwin Chen

Founder at Surge AI. We're a team of engineers and researchers from Google, Facebook, Harvard, and MIT building a modern data labeling platform and workforce for NLP.

Need obsessively high-quality human-powered data fast? Reach out! We help top AI companies like OpenAI, Amazon, and Airbnb create powreful human-labeled datasets to train and measure their AI.

Former AI & engineering lead at Google, Facebook, Twitter, Dropbox, and MSR. Pure math, theoretical CS, and linguistics at MIT.

Surge AI
Surge AI's Twitter

Recent Posts

A Visual, Layman's Introduction to Language Models in NLP

Surge AI: A New Data Labeling Platform and Workforce for NLP

Exploring LSTMs

Moving Beyond CTR: Better Recommendations Through Human Evaluation

Propensity Modeling, Causal Inference, and Discovering Drivers of Growth

Product Insights for Airbnb

Improving Twitter Search with Real-Time Human Computation

Edge Prediction in a Social Graph: My Solution to Facebook's User Recommendation Contest on Kaggle

Soda vs. Pop with Twitter

Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process

Instant Interactive Visualization with d3 + ggplot2

Movie Recommendations and More via MapReduce and Scalding

Quick Introduction to ggplot2

Introduction to Conditional Random Fields

Winning the Netflix Prize: A Summary

Stuff Harvard People Like

Information Transmission in a Social Network: Dissecting the Spread of a Quora Post

Introduction to Latent Dirichlet Allocation

Introduction to Restricted Boltzmann Machines

Topic Modeling the Sarah Palin Emails

Filtering for English Tweets: Unsupervised Language Detection on Twitter

Choosing a Machine Learning Classifier

Kickstarter Data Analysis: Success and Pricing

A Mathematical Introduction to Least Angle Regression

Introduction to Cointegration and Pairs Trading

Counting Clusters

Hacker News Analysis

Layman's Introduction to Measure Theory

Layman's Introduction to Random Forests

Netflix Prize Summary: Factorization Meets the Neighborhood

Netflix Prize Summary: Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

Prime Numbers and the Riemann Zeta Function

Topological Combinatorics and the Evasiveness Conjecture

Item-to-Item Collaborative Filtering with Amazon's Recommendation System