Edwin Chen's Blog

A Visual Tool for Exploring Word Embeddings

2022-02-11T00:00:00+01:00

I built a visualization to explore embeddings a few years ago, but never posted it more broadly. So here it is! http://blog.echen.me/embedding-explorer/

These are GloVe embeddings projected into 2D, colorized via k-means in the original space.

You can see, for example, that the cluster in pink at the bottom right is a cluster of names.

And the cluster in red on the right is a cluster of geographical terms.

We can click on one of these points ("iceland") to see its nearest neighbors in the high-dimensional space (mostly other countries!) as well as other points that belong to the same cluster (Cluster 18 is this red cluster).

We can also inspect each individual embedding dimension, to understand what it's picking up. Embedding Dimension 1, for example, seems to capture sportiness.

We can also slide through the points based on their embedding dimension to get a better sense:

Anyways, play around with the explorer here, and feedback is always welcome!

Surge AI: A New Data Labeling Platform and Workforce for NLP

2020-11-30T00:00:00+01:00

tl;dr I started Surge AI to fix the problems I've always encountered with getting high-quality, human-labeled data at scale. Think MTurk 2.0—but with an obsessive focus on quality and speed, and an elite workforce you can trust. If you've ever had problems getting human-annotated data, or wish you had a labeling platform you could use with your own workforce, reach out at edwin@surgehq.ai or get started here! We work with amazing AI companies like OpenAI, Airbnb, Amazon, and more on cutting-edge data and ML problems, and we'd love to help you as well.

Getting trustworthy human-labeled data has constantly been one of my biggest blockers, through a decade of working on real-world AI. Whether at Google, Facebook, or Twitter:

Getting ground truth for training our ML models, and measuring their relevance and precision, invariably took months from backlogged teams of in-house labelers.
Speed and scale weren't the only issues—in-house labelers often weren't high quality either! When your labeling software is just an Excel spreadsheet, it's difficult to monitor and motivate performance.
The quality and speed of external labeling companies were even worse. We'd wait 3 months to get 10K rows of text labeled for a new NLP model, only to find that 50% of the labels were complete spam.

This was a huge bottleneck for our AI and data science teams. If it takes months to get the data your ML needs, how can you iterate and move fast?

So I built our human evaluation platforms at YouTube and Twitter to fix these problems. And we used them to do amazing things: from real-time human/AI search and advertising systems, to shifting our recommender objectives from clickbait to human preferences (especially important in this age of political polarization!), and more.

A cornerstone of these data labeling platforms was the premise that in order to build increasingly sophisticated real-world AI—for complex problems like hate speech and misinformation—we need skilled, motivated human workforces to measure and train them. So when COVID hit, and a huge educated population became out of a job or stuck at home, we realized there was an opportunity to turn the platforms we'd built into a startup, and created Surge AI.

We've grown 10x in the past 6 months, and have worked with an amazing set of early customers on amazing things:

We've helped companies boost their ML model performance by 50%, simply by relabeling their existing datasets.
Our customers have dropped their time waiting for new labels—their biggest bottleneck in training new models—from 3-6 months to just a few days.
We label millions of images and pieces of text every week, in over a dozen languages.
We're not limited to content labeling: we help our customers with everything from content moderation and AI fairness, to making phone calls to gather business and medical information, to core image recognition and NLP.
Teams even use our API to gather human judgments in real time!

So if you need a high-quality labeling workforce you can trust, or want to use our labeling software for your own agents, please reach out or get started! I'd love to help you get the datasets and machine learning models you need, or even just hear more about your own problems and experiences.

How Could Facebook Align its ML Systems to Human Values? A Data-Driven Approach

2020-02-12T00:00:00+01:00

This is a crosspost from the official Surge AI blog.

For background on this blog post...

I used to work at Facebook, YouTube, and Twitter. One of the problems I big worked on: what was the right objective function to align our AI systems towards?

Optimizing for watch time at YouTube, for example, led to longer videos for the sake of longer videos, and clicks on videos with racy thumbnails.

Optimizing engagement at Facebook led to low-quality clickbait, and Hooter's appearing as the top search result when you searched for restaurants in Houston.

Optimizing for favorites and replies at Twitter led to toxic content at the top of the feed.

So while watch time, engagement, and replies would always go up – were these really the products we wanted to build? What happened to Facebook's original mission of connecting users with their friends and family? What did "favorites" have to do with being the platform for public conversation at Twitter? A ton of engineering and data science time is spent measuring active users, but why were there no dashboards measuring progress to broader product goals?

So could we figure out a metric that was better tuned to human values and preferences – but also fast, rigorous, and easily measurable? After all, we still need our A/B tests, ML objectives, and OKRs to align the company around.

This question is particularly important today, with all the troubles that social media platforms face, so I wrote up an approach I've often worked on, based on Facebook data. Read it on the Surge AI blog!

Exploring LSTMs

2017-05-30T00:00:00+02:00

LSTMs are behind a lot of the amazing achievements deep learning has made in the past few years, and they're a fairly simple extension to neural networks under the right view. So I'll try to present them as intuitively as possible – in such a way that you could have discovered them yourself.

But first, a picture:

Aren't LSTMs beautiful? Let's go.

(Note: if you're already familiar with neural networks and LSTMs, skip to the middle – the first half of this post is a tutorial.)

Neural Networks

Imagine we have a sequence of images from a movie, and we want to label each image with an activity (is this a fight?, are the characters talking?, are the characters eating?).

How do we do this?

One way is to ignore the sequential nature of the images, and build a per-image classifier that considers each image in isolation. For example, given enough images and labels:

Our algorithm might first learn to detect low-level patterns like shapes and edges.
With more data, it might learn to combine these patterns into more complex ones, like faces (two circular things atop a triangular thing atop an oval thing) or cats.
And with even more data, it might learn to map these higher-level patterns into activities themselves (scenes with mouths, steaks, and forks are probably about eating).

This, then, is a deep neural network: it takes an image input, returns an activity output, and – just as we might learn to detect patterns in puppy behavior without knowing anything about dogs (after seeing enough corgis, we discover common characteristics like fluffy butts and drumstick legs; next, we learn advanced features like splooting) – in between it learns to represent images through hidden layers of representations.

Mathematically

I assume people are familiar with basic neural networks already, but let's quickly review them.

A neural network with a single hidden layer takes as input a vector x, which we can think of as a set of neurons.
Each input neuron is connected to a hidden layer of neurons via a set of learned weights.
The jth hidden neuron outputs $h_j = \phi(\sum_i w_{ij} x_i)$, where $\phi$ is an activation function.
The hidden layer is fully connected to an output layer, and the jth output neuron outputs $y_j = \sum_i v_{ij} h_i$. If we need probabilities, we can transform the output layer via a softmax function.

In matrix notation:

$$h = \phi(Wx)$$

$$y = Vh$$

where

x is our input vector
W is a weight matrix connecting the input and hidden layers
V is a weight matrix connecting the hidden and output layers
Common activation functions for $\phi$ are the sigmoid function, $\sigma(x)$, which squashes numbers into the range (0, 1); the hyperbolic tangent, $tanh(x)$, which squashes numbers into the range (-1, 1), and the rectified linear unit, $ReLU(x) = max(0, x)$.

Here's a pictorial view:

(Note: to make the notation a little cleaner, I assume x and h each contain an extra bias neuron fixed at 1 for learning bias weights.)

Remembering Information with RNNs

Ignoring the sequential aspect of the movie images is pretty ML 101, though. If we see a scene of a beach, we should boost beach activities in future frames: an image of someone in the water should probably be labeled swimming, not bathing, and an image of someone lying with their eyes closed is probably suntanning. If we remember that Bob just arrived at a supermarket, then even without any distinctive supermarket features, an image of Bob holding a slab of bacon should probably be categorized as shopping instead of cooking.

So what we'd like is to let our model track the state of the world:

After seeing each image, the model outputs a label and also updates the knowledge it's been learning. For example, the model might learn to automatically discover and track information like location (are scenes currently in a house or beach?), time of day (if a scene contains an image of the moon, the model should remember that it's nighttime), and within-movie progress (is this image the first frame or the 100th?). Importantly, just as a neural network automatically discovers hidden patterns like edges, shapes, and faces without being fed them, our model should automatically discover useful information by itself.
When given a new image, the model should incorporate the knowledge it's gathered to do a better job.

This, then, is a recurrent neural network. Instead of simply taking an image and returning an activity, an RNN also maintains internal memories about the world (weights assigned to different pieces of information) to help perform its classifications.

Mathematically

So let's add the notion of internal knowledge to our equations, which we can think of as pieces of information that the network maintains over time.

But this is easy: we know that the hidden layers of neural networks already encode useful information about their inputs, so why not use these layers as the memory passed from one time step to the next? This gives us our RNN equations:

$$h_t = \phi(Wx_t + Uh_{t-1})$$

$$y_t = Vh_t$$

Note that the hidden state computed at time $t$ ($h_t$, our internal knowledge) is fed back at the next time step. (Also, I'll use concepts like hidden state, knowledge, memories, and beliefs to describe $h_t$ interchangeably.)

Longer Memories through LSTMs

Let's think about how our model updates its knowledge of the world. So far, we've placed no constraints on this update, so its knowledge can change pretty chaotically: at one frame it thinks the characters are in the US, at the next frame it sees the characters eating sushi and thinks they're in Japan, and at the next frame it sees polar bears and thinks they're on Hydra Island. Or perhaps it has a wealth of information to suggest that Alice is an investment analyst, but decides she's a professional assassin after seeing her cook.

This chaos means information quickly transforms and vanishes, and it's difficult for the model to keep a long-term memory. So what we'd like is for the network to learn how to update its beliefs (scenes without Bob shouldn't change Bob-related information, scenes with Alice should focus on gathering details about her), in a way that its knowledge of the world evolves more gently.

This is how we do it.

Adding a forgetting mechanism. If a scene ends, for example, the model should forget the current scene location, the time of day, and reset any scene-specific information; however, if a character dies in the scene, it should continue remembering that he's no longer alive. Thus, we want the model to learn a separate forgetting/remembering mechanism: when new inputs come in, it needs to know which beliefs to keep or throw away.
Adding a saving mechanism. When the model sees a new image, it needs to learn whether any information about the image is worth using and saving. Maybe your mom sent you an article about the Kardashians, but who cares?
So when new a input comes in, the model first forgets any long-term information it decides it no longer needs. Then it learns which parts of the new input are worth using, and saves them into its long-term memory.
Focusing long-term memory into working memory. Finally, the model needs to learn which parts of its long-term memory are immediately useful. For example, Bob's age may be a useful piece of information to keep in the long term (children are more likely to be crawling, adults are more likely to be working), but is probably irrelevant if he's not in the current scene. So instead of using the full long-term memory all the time, it learns which parts to focus on instead.

This, then, is an long short-term memory network. Whereas an RNN can overwrite its memory at each time step in a fairly uncontrolled fashion, an LSTM transforms its memory in a very precise way: by using specific learning mechanisms for which pieces of information to remember, which to update, and which to pay attention to. This helps it keep track of information over longer periods of time.

Mathematically

Let's describe the LSTM additions mathematically.

At time $t$, we receive a new input $x_t$. We also have our long-term and working memories passed on from the previous time step, $ltm_{t-1}$ and $wm_{t-1}$ (both n-length vectors), which we want to update.

We'll start with our long-term memory. First, we need to know which pieces of long-term memory to continue remembering and which to discard, so we want to use the new input and our working memory to learn a remember gate of n numbers between 0 and 1, each of which determines how much of a long-term memory element to keep. (A 1 means to keep it, a 0 means to forget it entirely.)

Naturally, we can use a small neural network to learn this remember gate:

$$remember_t = \sigma(W_r x_t + U_r wm_{t-1}) $$

(Notice the similarity to our previous network equations; this is just a shallow neural network. Also, we use a sigmoid activation because we need numbers between 0 and 1.)

Next, we need to compute the information we can learn from $x_t$, i.e., a candidate addition to our long-term memory:

$$ ltm'_t = \phi(W_l x_t + U_l wm_{t-1}) $$

$\phi$ is an activation function, commonly chosen to be $tanh$.

Before we add the candidate into our memory, though, we want to learn which parts of it are actually worth using and saving:

$$save_t = \sigma(W_s x_t + U_s wm_{t-1})$$

(Think of what happens when you read something on the web. While a news article might contain information about Hillary, you should ignore it if the source is Breitbart.)

Let's now combine all these steps. After forgetting memories we don't think we'll ever need again and saving useful pieces of incoming information, we have our updated long-term memory:

$$ltm_t = remember_t \circ ltm_{t-1} + save_t \circ ltm'_t$$

where $\circ$ denotes element-wise multiplication.

Next, let's update our working memory. We want to learn how to focus our long-term memory into information that will be immediately useful. (Put differently, we want to learn what to move from an external hard drive onto our working laptop.) So we learn a focus/attention vector:

$$focus_t = \sigma(W_f x_t + U_f wm_{t-1})$$

Our working memory is then

$$wm_t = focus_t \circ \phi(ltm_t)$$

In other words, we pay full attention to elements where the focus is 1, and ignore elements where the focus is 0.

And we're done! Hopefully this made it into your long-term memory as well.

To summarize, whereas a vanilla RNN uses one equation to update its hidden state/memory:

$$h_t = \phi(Wx_t + Uh_{t-1})$$

An LSTM uses several:

$$ltm_t = remember_t \circ ltm_{t-1} + save_t \circ ltm'_t$$

$$wm_t = focus_t \circ tanh(ltm_t)$$

where each memory/attention sub-mechanism is just a mini brain of its own:

$$remember_t = \sigma(W_r x_t+ U_r wm_{t-1}) $$

$$save_t = \sigma(W_s x_t + U_s wm_{t-1})$$

$$focus_t = \sigma(W_f x_t + U_f wm_{t-1})$$

$$ ltm'_t = tanh(W_l x_t + U_l wm_{t-1}) $$

(Note: the terminology and variable names I've been using are different from the usual literature. Here are the standard names, which I'll use interchangeably from now on:

The long-term memory, $ltm_t$, is usually called the cell state, denoted $c_t$.
The working memory, $wm_t$, is usually called the hidden state, denoted $h_t$. This is analogous to the hidden state in vanilla RNNs.
The remember vector, $remember_t$, is usually called the forget gate (despite the fact that a 1 in the forget gate still means to keep the memory and a 0 still means to forget it), denoted $f_t$.
The save vector, $save_t$, is usually called the input gate (as it determines how much of the input to let into the cell state), denoted $i_t$.
The focus vector, $focus_t$, is usually called the output gate, denoted $o_t$. )

Snorlax

I could have caught a hundred Pidgeys in the time it took me to write this post, so here's a cartoon.

Neural Networks

Recurrent Neural Networks

LSTMs

Learning to Code

Let's look at a few examples of what an LSTM can do. Following Andrej Karpathy's terrific post, I'll use character-level LSTM models that are fed sequences of characters and trained to predict the next character in the sequence.

While this may seem a bit toyish, character-level models can actually be very useful, even on top of word models. For example:

Imagine a code autocompleter smart enough to allow you to program on your phone. An LSTM could (in theory) track the return type of the method you're currently in, and better suggest which variable to return; it could also know without compiling whether you've made a bug by returning the wrong type.
NLP applications like machine translation often have trouble dealing with rare terms. How do you translate a word you've never seen before, or convert adjectives to adverbs? Even if you know what a tweet means, how do you generate a new hashtag to capture it? Character models can daydream new terms, so this is another area with interesting applications.

So to start, I spun up an EC2 p2.xlarge spot instance, and trained a 3-layer LSTM on the Apache Commons Lang codebase. Here's a program it generates after a few hours.

While the code certainly isn't perfect, it's better than a lot of data scientists I know. And we can see that the LSTM has learned a lot of interesting (and correct!) coding behavior:

It knows how to structure classes: a license up top, followed by packages and imports, followed by comments and a class definition, followed by variables and methods. Similarly, it knows how to create methods: comments follow the correct orders (description, then @param, then @return, etc.), decorators are properly placed, and non-void methods end with appropriate return statements. Crucially, this behavior spans long ranges of code – see how giant the blocks are!
It can also track subroutines and nesting levels: indentation is always correct, and if statements and for loops are always closed out.
It even knows how to create tests.

How does the model do this? Let's look at a few of the hidden states.

Here's a neuron that seems to track the code's outer level of indentation:

(As the LSTM moves through the sequence, its neurons fire at varying intensities. The picture represents one particular neuron, where each row is a sequence and characters are color-coded according to the neuron's intensity; dark blue shades indicate large, positive activations, and dark red shades indicate very negative activations.)

And here's a neuron that counts down the spaces between tabs:

For kicks, here's the output of a different 3-layer LSTM trained on TensorFlow's codebase:

There are plenty of other fun examples floating around the web, so check them out if you want to see more.

Investigating LSTM Internals

Let's dig a little deeper. We looked in the last section at examples of hidden states, but I wanted to play with LSTM cell states and their other memory mechanisms too. Do they fire when we expect, or are there surprising patterns?

Counting

To investigate, let's start by teaching an LSTM to count. (Remember how the Java and Python LSTMs were able to generate proper indentation!) So I generated sequences of the form

aaaaaXbbbbb

(N "a" characters, followed by a delimiter X, followed by N "b" characters, where 1 <= N <= 10), and trained a single-layer LSTM with 10 hidden neurons.

As expected, the LSTM learns perfectly within its training range – and can even generalize a few steps beyond it. (Although it starts to fail once we try to get it to count to 19.)

aaaaaaaaaaaaaaaXbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaXbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaXbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaXbbbbbbbbbbbbbbbbbb
aaaaaaaaaaaaaaaaaaaXbbbbbbbbbbbbbbbbbb # Here it begins to fail: the model is given 19 "a"s, but outputs only 18 "b"s.

We expect to find a hidden state neuron that counts the number of a's if we look at its internals. And we do:

I built a small web app to play around with LSTMs, and Neuron #2 seems to be counting both the number of a's it's seen, as well as the number of b's. (Remember that cells are shaded according to the neuron's activation, from dark red [-1] to dark blue [+1].)

What about the cell state? It behaves similarly:

One interesting thing is that the working memory looks like a "sharpened" version of the long-term memory. Does this hold true in general?

It does. (This is exactly as we would expect, since the long-term memory gets squashed by the tanh activation function and the output gate limits what gets passed on.) For example, here is an overview of all 10 cell state nodes at once. We see plenty of light-colored cells, representing values close to 0.

In contrast, the 10 working memory neurons look much more focused. Neurons 1, 3, 5, and 7 are even zeroed out entirely over the first half of the sequence.

Let's go back to Neuron #2. Here are the candidate memory and input gate. They're relatively constant over each half of the sequence – as if the neuron is calculating a += 1 or b += 1 at each step.

Finally, here's an overview of all of Neuron 2's internals:

If you want to investigate the different counting neurons yourself, you can play around with the visualizer here.

(Note: this is far from the only way an LSTM can learn to count, and I'm anthropomorphizing quite a bit here. But I think viewing the network's behavior is interesting and can help build better models – after all, many of the ideas in neural networks come from analogies to the human brain, and if we see unexpected behavior, we may be able to design more efficient learning mechanisms.)

Count von Count

Let's look at a slightly more complicated counter. This time, I generated sequences of the form

aaXaXaaYbbbbb

(N a's with X's randomly sprinkled in, followed by a delimiter Y, followed by N b's). The LSTM still has to count the number of a's, but this time needs to ignore the X's as well.

Here's the full LSTM. We expect to see a counting neuron, but one where the input gate is zero whenever it sees an X. And we do!

Above is the cell state of Neuron 20. It increases until it hits the delimiter Y, and then decreases to the end of the sequence – just like it's calculating a num_bs_left_to_print variable that increments on a's and decrements on b's.

If we look at its input gate, it is indeed ignoring the X's:

Interestingly, though, the candidate memory fully activates on the irrelevant X's – which shows why the input gate is needed. (Although, if the input gate weren't part of the architecture, presumably the network would have presumably learned to ignore the X's some other way, at least for this simple example.)

Let's also look at Neuron 10.

This neuron is interesting as it only activates when reading the delimiter "Y" – and yet it still manages to encode the number of a's seen so far in the sequence. (It may be hard to tell from the picture, but when reading Y's belonging to sequences with the same number of a's, all the cell states have values either identical or within 0.1% of each other. You can see that Y's with fewer a's are lighter than those with more.) Perhaps some other neuron sees Neuron 10 slacking and helps a buddy out.

Remembering State

Next, I wanted to look at how LSTMs remember state. I generated sequences of the form

AxxxxxxYa
BxxxxxxYb

(i.e., an "A" or B", followed by 1-10 x's, then a delimiter "Y", ending with a lowercase version of the initial character). This way the network needs to remember whether it's in an "A" or "B" state.

We expect to find a neuron that fires when remembering that the sequence started with an "A", and another neuron that fires when remembering that it started with a "B". We do.

For example, here is an "A" neuron that activates when it reads an "A", and remembers until it needs to generate the final character. Notice that the input gate ignores all the "x" characters in between.

Here is its "B" counterpart:

One interesting point is that even though knowledge of the A vs. B state isn't needed until the network reads the "Y" delimiter, the hidden state fires throughout all the intermediate inputs anyways. This seems a bit "inefficient", but perhaps it's because the neurons are doing a bit of double-duty in counting the number of x's as well.

Copy Task

Finally, let's look at how an LSTM learns to copy information. (Recall that our Java LSTM was able to memorize and copy an Apache license.)

(Note: if you think about how LSTMs work, remembering lots of individual, detailed pieces of information isn't something they're very good at. For example, you may have noticed that one major flaw of the LSTM-generated code was that it often made use of undefined variables – the LSTMs couldn't remember which variables were in scope. This isn't surprising, since it's hard to use single cells to efficiently encode multi-valued information like characters, and LSTMs don't have a natural mechanism to chain adjacent memories to form words. Memory networks and neural Turing machines are two extensions to neural networks that help fix this, by augmenting with external memory components. So while copying isn't something LSTMs do very efficiently, it's fun to see how they try anyways.)

For this copy task, I trained a tiny 2-layer LSTM on sequences of the form

baaXbaa
abcXabc

(i.e., a 3-character subsequence composed of a's, b's, and c's, followed by a delimiter "X", followed by the same subsequence).

I wasn't sure what "copy neurons" would look like, so in order to find neurons that were memorizing parts of the initial subsequence, I looked at their hidden states when reading the delimiter X. Since the network needs to encode the initial subsequence, its states should exhibit different patterns depending on what they're learning.

The graph below, for example, plots Neuron 5's hidden state when reading the "X" delimiter. The neuron is clearly able to distinguish sequences beginning with a "c" from those that don't.

For another example, here is Neuron 20's hidden state when reading the "X". It looks like it picks out sequences beginning with a "b".

Interestingly, if we look at Neuron 20's cell state, it almost seems to capture the entire 3-character subsequence by itself (no small feat given its one-dimensionality!):

Here are Neuron 20's cell and hidden states, across the entire sequence. Notice that its hidden state is turned off over the entire initial subsequence (perhaps expected, since its memory only needs to be passively kept at that point).

However, if we look more closely, the neuron actually seems to be firing whenever the next character is a "b". So rather than being a "the sequence started with a b" neuron, it appears to be a "the next character is a b" neuron.

As far as I can tell, this pattern holds across the network – all the neurons seem to be predicting the next character, rather than memorizing characters at specific positions. For example, Neuron 5 seems to be a "next character is a c" predictor.

I'm not sure if this is the default kind of behavior LSTMs learn when copying information, or what other copying mechanisms are available as well.

States and Gates

To really hone in and understand the purpose of the different states and gates in an LSTM, let's repeat the previous section with a small pivot.

Cell State and Hidden State (Memories)

We originally described the cell state as a long-term memory, and the hidden state as a way to pull out and focus these memories when needed.

So when a memory is currently irrelevant, we expect the hidden state to turn off – and that's exactly what happens for this sequence copying neuron.

Forget Gate

The forget gate discards information from the cell state (0 means to completely forget, 1 means to completely remember), so we expect it to fully activate when it needs to remember something exactly, and to turn off when information is never going to be needed again.

That's what we see with this "A" memorizing neuron: the forget gate fires hard to remember that it's in an "A" state while it passes through the x's, and turns off once it's ready to generate the final "a".

Input Gate (Save Gate)

We described the job of the input gate (what I originally called the save gate) as deciding whether or not to save information from a new input. Thus, it should turn off at useless information.

And that's what this selective counting neuron does: it counts the a's and b's, but ignores the irrelevant x's.

What's amazing is that nowhere in our LSTM equations did we specify that this is how the input (save), forget (remember), and output (focus) gates should work. The network just learned what's best.

Extensions

Now let's recap how you could have discovered LSTMs by yourself.

First, many of the problems we'd like to solve are sequential or temporal of some sort, so we should incorporate past learnings into our models. But we already know that the hidden layers of neural networks encode useful information, so why not use these hidden layers as the memories we pass from one time step to the next? And so we get RNNs.

But we know from our own behavior that we don't keep track of knowledge willy-nilly; when we read a new article about politics, we don't immediately believe whatever it tells us and incorporate it into our beliefs of the world. We selectively decide what information to save, what information to discard, and what pieces of information to use to make decisions the next time we read the news. Thus, we want to learn how to gather, update, and apply information – and why not learn these things through their own mini neural networks? And so we get LSTMs.

And now that we've gone through this process, we can come up with our own modifications.

For example, maybe you think it's silly for LSTMs to distinguish between long-term and working memories – why not have one? Or maybe you find separate remember gates and save gates kind of redundant – anything we forget should be replaced by new information, and vice-versa. And now you've come up with one popular LSTM variant, the GRU.
Or maybe you think that when deciding what information to remember, save, and focus on, we shouldn't rely on our working memory alone – why not use our long-term memory as well? And now you've discovered Peephole LSTMs.

Making Neural Nets Great Again

Let's look at one final example, using a 2-layer LSTM trained on Trump's tweets. Despite the ~~tiny~~ big dataset, it's enough to learn a lot of patterns.

For example, here's a neuron that tracks its position within hashtags, URLs, and @mentions:

Here's a proper noun detector (note that it's not simply firing at capitalized words):

Here's an auxiliary verb + "to be" detector ("will be", "I've always been", "has never been"):

Here's a quote attributor:

There's even a MAGA and capitalization neuron:

And here are some of the proclamations the LSTM generates (okay, one of these is a real tweet):

Unfortunately, the LSTM merely learned to ramble like a madman.

Recap

That's it. To summarize, here's what you've learned:

Here's what you should save:

And now it's time for that donut.

Thanks to Chen Liang for some of the TensorFlow code I used, Ben Hamner and Kaggle for the Trump dataset, and, of course, Schmidhuber and Hochreiter for their original paper. If you want to explore the LSTMs yourself, feel free to play around!

Moving Beyond CTR: Better Recommendations Through Human Evaluation

2014-10-07T00:00:00+02:00

Imagine you're building a recommendation algorithm for your new online site. How do you measure its quality, to make sure that it's sending users relevant and personalized content? Click-through rate may be your initial hope…but after a bit of thought, it's not clear that it's the best metric after all.

Take Google's search engine. In many cases, improving the quality of search results will decrease CTR! For example, the ideal scenario for queries like When was Barack Obama born? is that users never have to click, since the question should be answered on the page itself.

Or take Twitter, who one day might want to recommend you interesting tweets. Metrics like CTR, or even number of favorites and retweets, will probably optimize for showing quick one-liners and pictures of funny cats. But is a Reddit-like site what Twitter really wants to be? Twitter, for many people, started out as a news site, so users may prefer seeing links to deeper and more interesting content, even if they're less likely to click on each individual suggestion overall.

Or take eBay, who wants to help you find the products you want to buy. Is CTR a good measure? Perhaps not: more clicks may be an indication that you're having trouble finding what you're looking for. What about revenue? That might not be ideal either: from the user perspective, you want to make your purchases at the lowest possible price, and by optimizing for revenue, eBay may be turning you towards more expensive products that make you a less frequent customer in the long run.

And so on.

So on many online sites, it's unclear how to measure the quality of personalization and recommendations using metrics like CTR, or revenue, or dwell time, or whatever. What's an engineer to do?

Well, consider the fact that many of these are relevance algorithms. Google wants to show you relevant search results. Twitter wants to show you relevant tweets and ads. Netflix wants to recommend you relevant movies. LinkedIn wants to find you relevant people to follow. So why, so often, do we never try to measure the relevance of our models?

I'm a big fan of man-in-the-machine techniques, so to get around this problem, I'm going to talk about a human evaluation approach to measuring the performance of personalization and discovery products. In particular, I'll use the example of related book suggestions on Amazon as I walk through the rest of this post.

Amazon, and Moving Beyond Log-Based Metrics

(Let's continue motivating a bit why log-based metrics are often imperfect measures of relevance and quality, as this is an important but difficult-to-understand point.)

So take Amazon's Customers Who Bought This Item Also Bought feature, which tries to show you related books.

To measure its effectiveness, the standard approach is to run a live experiment and measure the change in metrics like revenue or CTR.

But imagine we suddenly replace all of Amazon's book suggestions with pornography. What's going to happen? CTR's likely to shoot through the roof!

Or suppose that we replace all of Amazon's related books with shinier, more expensive items. Again, CTR and revenue are likely to increase, as the flashier content draws eyeballs. But is this anything but a short-term boost? Perhaps the change decreases total sales in the long run, as customers start to find Amazon too expensive for their tastes and move to other marketplaces.

Scenarios like these are the machine learning analogue of turning ads into blinking marquees. While they might increase clicks and views initially, they're probably not optimizing user happiness or site quality for the future. So how can we avoid them, and ensure that the quality of our suggestions remains consistently high? This is a related books algorithm, after all – so why, by sticking to a live experiment and metrics like CTR, are we nowhere inspecting the relatedness of our recommendations?

Human Evaluation

Solution: let's inject humans into the process. Computers can't measure relatedness (if they could, we'd be done), but people of course can.

For example, in the screenshot below, I asked a worker (on a crowdsourcing platform I've built on my own) to rate the first three Customers Who Bought This Also Bought suggestions for a book on barnesandnoble.com.

(Copying the text from the screenshot:

Fool's Assassin, by Robin Hobb. (original book) "I only just read this recently, but it's one of my favorites of this year's releases. This book was so beautifully written. I've always been a fan of the Fantasy genre, but often times it's all stylistically similar. Hobb has a wonderful way with characters and narration. The story was absolutely heartbreaking and lead me wanting another book."
The Broken Eye (Lightbringer Series #3), by Brent Weeks. (first related book) "It's a good, but not great, recommendation. This book sounds interesting enough. It's third in a series though, so I'd have to see how I like the first couple of books first."
Dust and Light: A Sanctuary Novel, by Carol Berg. (second related book) "It's an okay recommendation. I'm not so familiar with this author, but I like some of the premise of the book. I know the saying 'Don't judge a book by its cover...' but the cover art doesn't really appeal to me and kind of turns me off of the book."
Tower Lord, by Anthony Ryan. (third related book) "It's a good, but not great, recommendation. Another completely unfamiliar author to me (kind of like this though, it shows me new places to look for books!) This is also a sequel, though, so I'd have to see how I liked the first book before purchasing this one.")

The recommendations are decent, but already we see a couple ways to improve them:

First, two of the recommendations (The Broken Eye and Tower Lord) are each part of a series, but not Book #1. So one improvement would be to ensure that only series introductions get displayed unless they're followups to the main book.
Book covers matter! Indeed, the second suggestion looks more like a fantasy romance novel than the kind of fantasy that Robin Hobb tends to write. (So perhaps B&N should invest in some deep learning...)

CTR and revenue certainly wouldn't give us this level of information, and it's not clear that they could even tell us our algorithms are producing irrelevant suggestions in the first place. Nowhere does the related scroll panel make it clear that two of the books are part of a series, so the CTR on those two books would be just as high as if they were indeed the series introductions. And if revenue is low, it's not clear whether it's because the suggestions are bad or because, separately, our pricing algorithms need improvement.

So in general, here's one way to understand the quality of a Do This Next algorithm:

Take a bunch of items (e.g., books if we're Amazon or Barnes & Noble), and generate their related brethren.
Send these pairs off to a bunch of judges (say, by using a crowdsourcing platform like Hybrid), and ask them to rate their relevance.
Analyze the data that comes back.

Algorithmic Relevance on Amazon, Barnes & Noble, and Google

Let's make this process concrete. Pretend I'm a newly minted VP of What Customers Also Buy at Amazon, and I want to understand the product's flaws and stars.

I started by asking a couple hundred of my own human workers to take a book they enjoyed in the past year, and to find it on Amazon. They'd then take the first three related book suggestions from a different author, rate them on the following scale, and explain their ratings.

Great suggestion. I'd definitely buy it. (Very positive)
Decent suggestion. I might buy it. (Mildly positive)
Not a great suggestion. I probably wouldn't buy it. (Mildly negative)
Terrible suggestion. I definitely wouldn't buy it. (Very negative)

(Note: I usually prefer a three-point or five-point Likert scale with a "neutral" option, but I was feeling a little wild.)

For example, here's how a rater reviewed the related books for Anne Rice's The Wolves of Midwinter.

So how good are Amazon's recommendations? Quite good, in fact: 47% of raters said they'd definitely buy the first related book, another 29% said it was good enough that they might buy it, and only 24% of raters disliked the suggestion.

The second and third book suggestions, while a bit worse, seem to perform pretty well too: around 65% of raters rated them positive.

What can we learn from the bad ratings? I ran a follow-up task that asked workers to categorize the bad related books, and here's the breakdown.

Related but different-subtopic. These were book suggestions that were generally related to the original book, but that were in a different sub-topic that the rater wasn't interested in. For example, the first related book for Sun Tzu's The Art of War (a book nominally about war, but which nowadays has become more of a life hack book) was On War (a war treatise), but the rater wasn't actually interested in the military aspects: "I would not buy this book, because it only focuses on military war. I am not interested in that. I am interested in mental tactics that will help me prosper in life."
Completely unrelated. These were book suggestions that were completely unrelated to the original book. For example, a Scrabble dictionary appearing on The Art of War's page.
Uninteresting. These were suggestions that were related, but whose storylines didn't appeal to the rater. "The storyline doesn't seem that exciting. I am not a dog fan and it's about a dog."
Wrong audience. These were book suggestions whose target audiences were quite different from the original book's audiences. In many cases, for example, a related book suggestion would be a children's book, but the original book would be geared towards adults. "This seems to be a children's book. If I had a child I would definitely buy this; alas I do not, so I have no need for it."
Wrong book type. Suggestions in this category were items like textbooks or appearing alongside novels.
Disliked author. These recommendations were by similar authors, but one that the rater disliked. "I do not like Amber Portwood. I would definitely not want to read a book by and about her."
Not first in series. Some book recommendations would be for an interesting series the rater wasn't familiar with, but they wouldn't be the first book in the series.
Bad rating. These were book suggestions that had a particularly low Amazon rating.

So to improve their recommendations, Amazon could try improving its topic models, add age-based features to its books, distinguish between textbooks and novels, and invest in series detectors. (Of course, for all I know, they do all this already.)

Competitive Analysis

We now have a general grasp of Amazon's related book suggestions and how they could be improved, and just like we could quote a metric like a CTR of 6.2% or whatnot, we can also now quote a relevance score of 0.62 (or whatever). So let's turn to the question of how Amazon compares to other online booksellers like Barnes & Noble and Google Play.

I took the same task I used above, but this time asked raters to review the related suggestions on those two sites as well.

In short,

Barnes & Nobles's algorithm are almost as good as Amazon's: the first three suggestions were rated positive 58% of the time, compared to 68% on Amazon.
But Play Store recommendations are atrocious : a whopping 51% of Google's related book recommendations were marked terrible.

Why are the Play Store's suggestions so bad? Let's look at a couple examples.

Here's the Play Store page for John Green's The Fault in Our Stars, a critics-loved-it book about cancer and romance (and now also a movie).

Two of the suggestions are completely random: a poorly-rated Excel manual and a poorly-reviewed textbook on sexual health. The others are completely unrelated cowboy books, by a different John Green.

Here's the page for The Strain. In this case, all the suggestions are in a different language! And there are only four of them.

Once again asking raters to categorize all of the Play Store's bad recommendations...

45% of the time, the related book suggestions were completely unrelated to the original book in question. For example: displaying a physics textbook on the page for a romance novel.
32% of the time, there simply wasn't a book suggestion at all. (I'm guessing the Play Bookstore's catalog is pretty limited.)
14% of the time, the related books were in a different language.

So despite Google's state-of-the-art machine learning elsewhere, its Play Store suggestions couldn't really get much worse.

Side-by-Sides

Let's step back a bit. So far I've been focusing on an absolute judgments paradigm, in which judges rate how relevant a book is to the original on an absolute scale. This model is great for understanding the overall quality of Amazon's related book algorithm.

In many cases, though, we want to use human evaluation to compare experiments. For example, it's common at many companies to:

Launch a "human evaluation A/B test" before a live experiment, both to avoid accidentally sending out incredibly buggy experiments to users, as well as to avoid the long wait required in live tests.
Use a human-generated relevance score as a supplement to live experiment metrics when making launch decisions.

For these kinds of tasks, what's preferable is often a side-by-side model, wherein judges are given two items and asked which one is better. After all, comparative judgments are often much easier to make than absolute ones, and we might want to detect differences at a finer level than what's available on an absolute scale.

The idea is that we can assign a score to each rating (negative, say, if the rater prefers the control item; positive if the rater prefers the experiment), and we aggregate these to form an overall score for the side-by-side. Then in much the same way that drops in CTR may block an experiment launch, a negative human evaluation score should also give much cause for concern.

Unfortunately, I don't have an easy way to generate data for a side-by-side (though I could perform a side-by-side on Amazon vs. Barnes & Noble), so I'll omit an example, but the idea should be pretty clear.

Personalization

Here's another subtlety. In my examples above, I asked raters to pick a starting book themselves (one that they read and loved in the past year), and then rate whether they personally would want to read the related suggestions.

Another approach is to pick the starting books for the raters, and then have the rate the related suggestions more objectively, by trying to put themselves in the shoes of someone who'd be interested in the starting item.

Which approach is better? As you can probably guess, there's no clear answer – it depends on the task and goals at hand.

Pros of the first approach:

It's much more nuanced. It can often be hard to put yourself in someone else's shoes: would someone reading Harry Potter be interested in Twilight? On the one hand, they're both fantasy books; on the other hand, Twilight seems a bit more geared towards female audiences.

Pros of the second approach:

Sometimes, objectivity is a good thing. (Should you really care if someone dislikes Twilight simply because Edward reminds her of her boyfriend?)
Allowing people to choose their starting items may bias certain metrics. For instance, people are much more likely to choose popular books to rate, whereas we might want to measure the quality of Amazon's suggestions across a broader, more representative slice of its catalog.

Recap

Let's review what I've discussed so far.

Online, log-based metrics like CTR and revenue aren't necessarily the best measures of a discovery algorithm's quality. Items with high CTR, for example, may simply be racy and flashy, not the most relevant.
So instead of relying on these proxies, let's directly measure the relevance of our recommendations by asking a pool of human raters.
There are a couple different approaches for sending which items to be judged. We can let the raters choose the items (an approach which is often necessary for personalized algorithms), or we can generate the items ourselves (often useful for more objective tasks like search; this also has the benefit of making it easier to derive popularity-weighted or uniformly-weighted metrics of relevance). We can also take an absolute judgment approach, or use side-by-sides.
By analyzing the data from our evaluations, we can make better launch decisions, discover examples where our algorithms do very well or very poorly, and find patterns for improvement.

What are some of the benefits and applications?

As mentioned, log-based metrics like CTR and revenue don't always capture the signals we want, so human-generated scores of relevance (or other dimensions) are useful complements.
Human evaluations can make iteration quicker and easier. A algorithm change might normally require weeks of live testing before we gather enough data to know how it affects users, but we can easily have a task judged by humans in a couple hours.
Imagine we're an advertising company, and we choose which ads to show based on a combination of CTR and revenue. Once we gather hundreds of thousands of relevance judgments from our evaluations, we can build a relevance-based machine learning model to add to the mix, thereby injecting a more direct measure of quality into the system.
How can we decide what improvements need to be made to our models? Evaluations give us very concrete feedback and specific examples about what's working and what's wrong.

In the spirit of item-item similarities, here are some other posts readers of this post might also want to read.

And finally, I'll end with a call for information. Do you run any evaluations, or use crowdsourcing platforms like Mechanical Turk or Crowdflower (whether for evaluation purposes or not)? Or do you want to? I'd love to talk to you to learn more about what you're doing, so please feel free to send me an email and hit me up!

Propensity Modeling, Causal Inference, and Discovering Drivers of Growth

2014-08-15T00:00:00+02:00

Imagine you just started a job at a new company. You watched World War Z recently, so you're in a skeptical mood, and given that your last two startups failed from what you believe to be a lack of data, you're giving everything an extra critical eye.

You start by thinking about the impact of the sales team. How much extra revenue are they generating for the company? The sales folks you've met say that over 90% of the leads they've talked to end up buying the company's product – but, you wonder, how many of those leads would have converted anyways?

You take a look at the logs, and notice something interesting: last week was hack week, and half the salesforce took time off from making calls to make Marauder's Maps instead, yet the rate of converted leads remained the same...*

Suddenly, one of your teammates drops by your desk. He's making a batch of Soylent, and he wants you to take a sip. It looks nasty, so you ask what the benefits are, and he responds that his friends who've been drinking it for the past few months just ran a marathon! Oh, did they just start running? Nope, they ran the marathon last year too!...

*Inspired by a true story.

Causal Inference

Causality is incredibly important, yet often extremely difficult to establish.

Do patients who self-select into taking a new drug get better because the drug works, or would they have gotten better anyways? Is your salesforce actually effective, or are they simply talking to the customers who already plan to convert? Is Soylent (or your company's million-dollar ad campaign) truly worth your time?

In an ideal world, we'd be able to run experiments – the gold standard for measuring causality – whenever we wish. In the real world, however, we can't. There are ethical qualms with giving certain patients placebos, or dangerous and untested drugs. Management may be unwilling to take a potential short-term revenue hit by assigning sales to random customers, and a team earning commission-based bonuses may rebel against the thought as well.

How can we understand causal lifts in the absence of an A/B test? This is where propensity modeling, or other techniques of causal inference, comes into play.

Propensity Modeling

So suppose we want to model the effect of drinking Soylent using a propensity model technique. To explain the idea, let's start with a thought experiment.

Imagine Brad Pitt has a twin brother, indistinguishable in every way: Brad 1 and Brad 2 wake up at the same time, they eat the same foods, they exercise the same amount, and so on. One day, Brad 1 happens to receive the last batch of Soylent from a marketer on the street, while Brad 2 does not, and so only Brad 1 begins to incorporate Soylent into his diet. In this scenario, any subsequent difference in behavior between the twins is precisely the drink's effect.

Taking this scenario into the real world, one way to estimate the Soylent's effect on health would be as follows:

For every Soylent drinker, find a Soylent abstainer who's as close a match as possible. For example, we might match a Soylent-drinking Jay-Z with a non-Soylent Kanye, a Soylent-drinking Natalie Portman with a non-Soylent Keira Knightley, and a Soylent-drinking JK Rowling with a non-Soylent Stephenie Meyer.
We measure Soylent's effect as the difference between each twin pair.

However, finding closely matching twins is extremely difficult in practice. Is Jay-Z really a close match with Kanye, if Jay-Z sleeps one hour more on average? What about the Jonas Brothers and One Direction?

Propensity modeling, then, is a simplification of this twin matching procedure. Instead of matching pairs of people based on all the variables we have, we simply match all users based on a single number, the likelihood ("propensity") that they'll start to drink Soylent.

In more detail, here's how to build a propensity model.

First, select which variables to use as features. (e.g., what foods people eat, when they sleep, where they live, etc.)
Next, build a probabilistic model (say, a logistic regression) based on these variables to predict whether a user will start drinking Soylent or not. For example, our training set might consist of a set of people, some of whom ordered Soylent in the first week of March 2014, and we would train the classifier to model which users become the Soylent users.
The model's probabilistic estimate that a user will start drinking Soylent is called a propensity score.
Form some number of buckets, say 10 buckets in total (one bucket covers users with a 0.0 - 0.1 propensity to take the drink, a second bucket covers users with a 0.1 - 0.2 propensity, and so on), and place people into each one.
Finally, compare the drinkers and non-drinkers within each bucket (say, by measuring their subsequent physical activity, weight, or whatever measure of health) to estimate Soylent's causal effect.

For example, here's a hypothetical distribution of Soylent and non-Soylent ages. We see that drinkers tend to be quite a bit older, and this confounding fact is one reason we can't simply run a correlational analysis.

After training a model to estimate Soylent propensity and group users into propensity buckets, this might be a graph of the effect that Soylent has on a person's weekly running mileage.

In the above (hypothetical) graph, each row represents a propensity bucket of people, and the exposure week denotes the first week of March, when the treatment group received their Soylent shipment. We see that prior to that week, both groups of people track quite well. After the treatment group (the Soylent drinkers) start their plan, however, their weekly running mileage ramps up, which forms our estimate of the drink's causal effect.

Other Methods of Causal Inference

Of course, there are many other methods of causal inference on observational data. I'll run through two of my favorites. (I originally wrote this post in response to a question on Quora, which is why I take my examples from there.)

Regression Discontinuity

Quora recently started displaying badges of status on the profiles of its Top Writers, so suppose we want to understand the effect of this feature. (Assume that it's impossible to run an A/B test now that the feature has been launched.) Specifically, does the badge itself cause users to gain more followers?

For simplicity, let's assume that the badge was given to all users who received at least 5000 upvotes in 2013. The idea behind a regression discontinuity design is that the difference between those users who just barely receive a Top Writer badge (i.e., receive 5000 upvotes) and those who just barely don't (i.e., receive 4999 upvotes) is more or less random chance, so we can use this threshold to estimate a causal effect.

For example, in the imaginary graph below, the discontinuity at 5000 upvotes suggests that a Top Writer badge leads to around 100 more followers on average.

Natural Experiments

Understanding the effect of a Top Writer badge is a fairly uninteresting question, though. (It just makes an easy example.) A deeper, more fundamental question could be to ask: what happens when a user discovers a new writer that they love? Does the writer inspire them to write some of their own content, to explore more of the same topics, and through curation lead them to engage with the site even more? How important, in other words, is the connection to a great user as opposed to the reading of random, great posts?

I studied an analogous question when I was at Google, so instead of making up an imaginary Quora case study, I'll describe some of that work here.

So let's suppose we want to understand what would happen if we were able to match users to the perfect YouTube channel. How much is the ultimate recommendation worth?

Does falling in love with a new channel lead to engagement above and beyond activity on the channel itself, perhaps because users return to YouTube specifically for the new channel and stay to watch more? (a multiplicative effect) In the TV world, for example, perhaps many people stay at home on Sunday nights specifically to catch the latest episode of Real Housewives, and channel surf for even more entertainment once it's over.
Does falling in love with a new channel simply increase activity on the channel alone? (an additive effect)
Does a new channel replace existing engagement on YouTube? After all, maybe users only have a limited amount of time they can spend on the site. (a neutral effect)
Does the perfect channel actually cause users to spend less time overall on the site, since maybe they spend less time idly browsing and channel surfing once they have concrete channels they know how to turn to? (a negative effect)

As always, an A/B test would be ideal, but it's impossible in this case to run: we can't force users to fall in love with a channel (we can recommend them channels, but there's no guarantee they'll actually like them), and we can't forcibly block them from certain channels either.

One approach is to use a natural experiment (a scenario in which the universe itself somehow generates a random-like assignment) to study this effect. Here's the idea.

Consider a user who uploads a new video every Wednesday. One month, he lets his subscribers know that he won’t be uploading any new videos for a few weeks, while he goes on vacation.

How do his subscribers respond? Do they stop watching YouTube on Wednesdays, since his channel was the sole reason for their visits? Or is their activity relatively unaffected, since they only watch his content when it appears on the front page?

Imagine, instead, that the channel starts uploading a new video every Friday. Do his subscribers start to visit then as well? And now that they're on YouTube, do they merely stay for the new video, or does their visit lead to a sinkhole of searches and related content too?

As it turns out, these scenarios do happen. For example, here's a calendar of when one popular channel uploads videos. You can see that in 2011, it tended to upload videos on Tuesdays and Fridays, but it shifted to uploads on Wednesday and Saturday at the end of the year.

By using this shift as natural experiment that "quasi-randomly" removes a well-loved channel on certain days and introduces it on others, we can try to understand the effect of successfully making the perfect recommendation.

(This is probably a somewhat convoluted example of a natural experiment. For an example that perhaps illustrates the idea more clearly, suppose we want to understand the effect of income on mental health. We can't force some people to become poor or rich, and a correlational study is clearly flawed. This NY Times article describes a natural experiment when a group of Cherokee Indians distributed casino profits to its members, thereby "randomly" lifting some of them out of poverty.

Another example, assuming there's nothing special about the period in which hack week occurs, is the use of hack week as an instrument that quasi-randomly "prevents" the sales team from doing their job, as in the scenario I described above.)

Discovering Drivers of Growth

Let's go back to propensity modeling.

Imagine that we're on our company's Growth team, and we're tasked with figuring out how to turn casual users of the site into users that return every day. What do we do?

The propensity modeling approach might be the following. We could take a list of features (installing the mobile app, logging in, signing up for a newsletter, following certain users, etc.), and build a propensity model for each one. We could then rank each feature by its estimated causal effect on engagement, and use the ordered list of features to prioritize our next sprint. (Or we could use these numbers in order to convince the exec team that we need more resources.) This is a slightly more sophisticated version of the idea of building an engagement regression model (or a churn regression model), and examining the weights on each feature.

Despite writing this post, though, I admit I'm generally not a fan of propensity modeling for many applications in the tech world. (I haven't worked in the medical field, so I don't have a strong opinion on its usefulness there, though I think it's a little more necessary there.) I'll save more of my reasons for another time, but after all, causal inference is extremely difficult, and we're never going to be able to control for all the hidden influencers that can bias a treatment. Also, the mere fact that we have to choose which features to include in our model (and remember: building features is very time-consuming and difficult) means that we already have a strong prior belief on the usefulness of each feature, whereas what we'd really like to do is to discover hidden motivations of engagement that we've never thought of.

So what can we do instead?

If we're trying to understand what drives people to become heavy users of the site, why don't we simply ask them?

In more detail, let's do the following:

First, we'll run a survey on a couple hundred of users.
In the survey, we'll ask them whether their engagement on the site has increased, decreased, or remained about the same over the past year. We'll also ask them to explain possible reasons for their change in activity, and to describe how they use the site currently. We can also ask for supplemental details, like their demographic information.
Finally, we can filter all responses for those users who heavily increased their engagement over the past year (or who heavily decreased it, if we're trying to understand churn), and analyse their responses for the reasons.

For example, here's one interesting response I got when I ran this study at YouTube.

"I have always been a big music fan, but recently I took up playing the guitar. Because of my new found passion (playing the guitar) my desire to watch concerts has increased. I started watching a whole lot of music festivals and concerts that are posted on Youtube and other music videos. I have spent a lot of time also watching guitar lessons on Youtube (from www.justinguitar.com)."

This response was representative of a general theme the survey uncovered: one big driver of engagement seemed to come from people discovering a new offline hobby, and using YouTube to increase their appreciation of it. People who wanted to start cooking at home would turn to YouTube for recipe videos, people who started playing tennis or some other sport would go to YouTube for lessons or other great shots, college students would look for channels like Khan Academy to supplement their lectures, and so on. In other words, offline activities were driving online growth, and instead of trying to figure out what kinds of online content people were interested in (which articles did they like on Facebook, who did they follow on Twitter, what did they read on Reddit), perhaps we should have been focusing on bringing their physical hobbies into the digital world.

This "offline hobby" idea certainly wouldn't have been a feature I would have thrown into any engagement model, even if only because it's a very difficult feature to create. (How do we know which videos are related to real-world behavior?) But now that we suspect it's a potentially big driver of growth ("potentially" because, of course, surveys aren't necessarily representative), it's something we can spend a lot more time studying in the logs.

End

To summarize: propensity modeling is a powerful technique for measuring causal effects in the absence of a randomized experiment.

Purely correlational analyses on top of observational studies can be very dangerous, after all. To take my favorite example: if we find that cities with more policemen tend to have more crime, does this mean that we should try to reduce the size of our police forces in order to reduce the nation's amount of crime?

For another example, here's a post by Gelman on contradictory conclusions about hormone replacement therapy in the Harvard Nurses Study.

That said, remember that (as always) a model is only as good as the data that you feed it. It's super difficult to account for all the hidden variables that might matter, and what you think might be a well-designed causal model might well in fact be missing many hidden factors. (I actually remember hearing that a propensity model on the nurses study generated a flawed conclusion, though I can't find any references to this at the moment.) So consider whether there are other approaches you can take, whether it's an easier-to-understand causal technique or even just asking your users, and even if a randomized experiment seems too difficult to run now, the effort may be well worth the trouble in the end.

Product Insights for Airbnb

2014-08-14T00:00:00+02:00

I love studying users and products, and think data science can be extremely useful in guiding product/strategy as a whole. So I thought it would be fun to depart from the usual machine learning and engineering things I write about, and do a quick study of Airbnb.

Think of this like business analysis, or strategy – from a data science point of view.

(It's in slide deck form, of course, because that's how these things roll.)

Improving Twitter Search with Real-Time Human Computation

2013-01-08T04:15:00+01:00

(This is a post from the Twitter Engineering Blog that I wrote with Alpa Jain.)

One of the magical things about Twitter is that it opens a window to the world in real-time. An event happens, and just seconds later, it’s shared for people across the planet to see.

Consider, for example, what happened when Flight 1549 crashed in the Hudson.

http://twitpic.com/135xa - There’s a plane in the Hudson. I’m on the ferry going to pick up the people. Crazy.
— Janis Krums (@jkrums) January 15, 2009

When Osama bin Laden was killed.

Helicopter hovering above Abbottabad at 1AM (is a rare event).
— Sohaib Athar (@ReallyVirtual) May 1, 2011

Or when Mitt Romney mentioned binders during the presidential debates.

Boy, I’m full of women! #debates
— Romney’s Binder (@RomneysBinder) October 17, 2012

When each of these events happened, people instantly came to Twitter – and, in particular, Twitter search – to discover what was happening.

From a search and advertising perspective, however, these sudden events pose several challenges:

The queries people perform have never before been seen, so it’s impossible to know beforehand what they mean. How would you know that #bindersfullofwomen refers to politics, and not office accessories, or that people searching for “Horses and Bayonets” are interested in the debates?
Since these spikes in search queries are so short-lived, there’s only a short window of opportunity to learn what they mean.

So an event happens, people instantly come to Twitter to search for the event, and we need to teach our systems what these queries mean as quickly as we can, because in just a few hours those searches will be gone.

How do we do this? We’ll describe a novel real-time human computation engine we built that allows us to find search queries as soon as they’re trending, send these queries to real humans to be judged, and finally incorporate these human annotations into our backend models.

Overview

Before we dive into the details, here’s an overview of how the system works.

(1) First, we monitor for which search queries are currently popular.

Behind the scenes: we run a Storm topology that tracks statistics on search queries.

For example: the query “Big Bird” may be averaging zero searches a day, but at 6pm on October 3, we suddenly see a spike in searches from the US.

(2) Next, as soon as we discover a new popular search query, we send it to our human evaluation systems, where judges are asked a variety of questions about the query.

Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to my team's human computation platform, and then polls for a response.

For example: as soon as we notice “Big Bird” spiking, we may ask human judges to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant tweets and ads.

Finally, after a response from a judge is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our human judges tell us that “Big Bird” is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.

Let’s now explore the first two sections above in more detail.

Monitoring for popular queries

Storm is a distributed system for real-time computation. In contrast to batch systems like Hadoop, which often introduce delays of hours or more, Storm allows us to run online data processing algorithms to discover search spikes as soon as they happen.

In brief, running a job on Storm involves creating a Storm topology that describes the processing steps that must occur, and deploying this topology to a Storm cluster. A topology itself consists of three things:

Tuple streams of data. In our case, these may be tuples of (search query, timestamp).
Spouts that produce these tuple streams. In our case, we attach spouts to our search logs, which get written to every time a search occurs.
Bolts that process tuple streams. In our case, we use bolts for operations like updating total query counts, filtering out non-English queries, and checking whether an ad is currently being served up for the query.

Here’s a step-by-step walkthrough of how our popular query topology works:

Whenever you perform a search on Twitter, the search request gets logged to a Kafka queue.
The Storm topology attaches a spout to this Kafka queue, and the spout emits a tuple containing the query and other metadata (e.g., the time the query was issued and its location) to a bolt for processing.
This bolt updates the count of the number of times we’ve seen this query, checks whether the query is “currently popular” (using various statistics like time-decayed counts, the geographic distribution of the query, and the last time this query was sent for annotations), and dispatches it to our human computation pipeline if so.

One interesting feature of our popularity algorithm is that we often rejudge queries that have been annotated before, since the intent of a search can change. For example, perhaps people normally search for “Clint Eastwood” because they’re interested in his movies, but during the Republican National Convention users may have wanted to see tweets that were more political in nature.

Human evaluation of popular search queries

At Twitter, we use human computation for a variety of tasks. (See also Clockwork Raven, an open-source crowdsourcing platform we built that makes launching tasks easier.) For example, we often run experiments to measure ad relevance and search quality, we use it to gather data to train and evaluate our machine learning models, and in this section we’ll describe how we use it to boost our understanding of popular search queries.

So suppose that our Storm topology has detected that the query “Big Bird” is suddenly spiking. Since the query may remain popular for only a few hours, we send it off to live humans, who can help us quickly understand what it means; this dispatch is performed via a Thrift service that allows us to design our tasks in a web frontend, and later programmatically submit them to our human computation platform using any of the different languages we use across Twitter.

On our crowdsourcing platforms, judges are asked several questions about the query that help us serve better ads. Without going into the exact questions, here are flavors of a few possibilities:

What category does the query belong to? For example, “Stanford” may typically be an education-related query, but perhaps there’s a football game between Stanford and Berkeley at the moment, in which case the current search intent would be sports.
Does the query refer to a person? If so, who, and what is their Twitter handle if they have one? For example, the query “Happy Birthday Harry” may be trending, but it’s hard to know beforehand which of the numerous celebrities named Harry it’s referring to. Is it One Direction’s Harry Styles, in which case the searcher is likely to be interested in teen pop? Harry Potter, in which case the searcher is likely to be interested in fantasy novels? Or someone else entirely?

Labelers in the machine

Since humans are core to this system, let’s describe how our workforce was designed to give us fast, reliable results.

For completing all our tasks, we use a custom pool of judges to ensure high quality. Other typical possibilities in the crowdsourcing world are to use a static set of in-house judges, to use the standard worker filters that Amazon provides, or to go through an outside company like Crowdflower. We’ve experimented with these other solutions, and while they have their own benefits, we found that a custom pool fit our needs best for a few reasons:

In-house judges can provide high-quality work as well, but they usually work standard hours (for example, 9 to 5 if they work onsite, or a relatively fixed and limited set of hours if they work from home), it can be difficult to communicate with them and schedule them for work, and it’s hard to scale the hiring of more judges.
Using Crowdflower or Amazon’s standard filters makes it easy to scale the workforce, but their trust algorithms aren’t perfect, so an endless problem is that spammy workers get through and many of the judgments will be very poor quality. Two methods of combatting low quality are to seed gold standard examples for which you know the true response throughout your task, or to use statistical analysis to determine which workers are the good ones, but these can be time-consuming and expensive to create, and we often run tasks of a free-response researchy nature for which these solutions don’t work. Another problem is that using these filters gives you a fluid, constantly changing set of workers, which makes them hard to train.

In contrast:

Our custom pool of judges work virtually all day. For many of them, this is a full-time job, and they’re geographically distributed, so our tasks complete quickly at all hours; we can easily ask for thousands of judgments before lunch, and have them finished by the time we get back, which makes iterating on our experiments much easier.
We have several forums, mailing lists, and even live chatrooms set up, all of which makes it easy for judges to ask us questions and to respond to feedback. Our judges will even give us suggestions on how to improve our tasks; for example, when we run categorization tasks, they’ll often report helpful categories that we should add.
Since we only launch tasks on demand, our judges are never idly twiddling their thumbs waiting for tasks or completing busywork, and our jobs are rarely backlogged.
Because our judges are culled from the best of the crowdsourcing world, they’re experts at the kinds of tasks we send, and can often provide higher quality at a faster rate than what even in-house judges provide. For example, they’ll often use the forums and chatrooms to collaborate amongst themselves to give us the best judgments, and they’re already familiar with the Firefox and Chrome scripts that help them be the most efficient at their work.

All the benefits described above are especially valuable in this real-time search annotation case:

Having highly trusted workers means we don’t need to wait for multiple annotations on a single search query to confirm validity, so we can send responses to our backend as soon as a single judge responds. This entire pipeline is design for real-time, after all, so the lower the latency on the human evaluation part, the better.
The static nature of our custom pool means that the judges are already familiar with our questions, and don’t need to be trained again.
Because our workers aren’t limited to a fixed schedule or location, they can work anywhere, anytime – which is a requirement for this system, since global event spikes on Twitter are not beholden to a 9-to-5.
And with the multiple easy avenues of communication we have set up, it’s easy for us to answer questions that might arise when we add new questions or modify existing ones.

Edge Prediction in a Social Graph: My Solution to Facebook's User Recommendation Contest on Kaggle

2012-07-31T04:15:00+02:00

A couple weeks ago, Facebook launched a link prediction contest on Kaggle, with the goal of recommending missing edges in a social graph. I love investigating social networks, so I dug around a little, and since I did well enough to score one of the coveted prizes, I’ll share my approach here.

(For some background, the contest provided a training dataset of edges, a test set of nodes, and contestants were asked to predict missing outbound edges on the test set, using mean average precision as the evaluation metric.)

Exploration

What does the network look like? I wanted to play around with the data a bit first just to get a rough feel, so I made an app to interact with the network around each node.

Here’s a sample:

(Go ahead, click on the picture to play with the app yourself. It’s pretty fun.)

The node in black is a selected node from the training set, and we perform a breadth-first walk of the graph out to a maximum distance of 3 to uncover the local network. Nodes are sized according to their distance from the center, and colored according to a chosen metric (a personalized PageRank in this case; more on this later).

We can see that the central node is friends with three other users (in red), two of whom have fairly large, disjoint networks.

There are quite a few dangling nodes (nodes at distance 3 with only one connection to the rest of the local network), though, so let’s remove these to reveal the core structure:

And here’s an embedded version you can manipulate inline:

Since the default view doesn’t encode the distinction between following and follower relationships, we can mouse over each node to see who it follows and who it’s followed by. Here, for example, is the following/follower network of one of the central node’s friends:

The moused over node is highlighted in black, its friends (users who both follow the node and are followed back in turn) are colored in purple, its followees are teal, and its followers in orange. We can also see that the node shares a friend with the central user (triadic closure, holla!).

Here’s another network, this time of the friend at the bottom:

Interestingly, while the first friend had several only-followers (in orange), the second friend has none. (which suggests, perhaps, a node-level feature that measures how follow-hungry a user is…)

And here’s one more node, a little further out (maybe a celebrity, given it has nothing but followers?):

The Quiet One

Let’s take a look at another graph, one whose local network is a little smaller:

A Social Butterfly

And one more, whose local network is a little larger:

Again, I encourage everyone to play around with the app here, and I’ll come back to the question of coloring each node later.

Distributions

Next, let’s take a more quantitative look at the graph.

Here’s the distribution of the number of followers of each node in the training set (cut off at 50 followers for a better fit – the maximum number of followers is 552), as well as the number of users each node is following (again, cut off at 50 – the maximum here is 1566)

Nothing terribly surprising, but that alone is good to verify. (For people tempted to mutter about power laws, I’ll hold you off with the bitter coldness of baby Gauss’s tears.)

Similarly, here are the same two graphs, but limited to the nodes in the test set alone:

Notice that there are relatively more test set users with 0 followees than in the full training set, and relatively fewer test set users with 0 followers. This information could be used to better simulate a validation set for model selection, though I didn’t end up doing this myself.

Preliminary Probes

Finally, let’s move on to the models themselves.

In order to quickly get up and running on a couple prediction algorithms, I started with some unsupervised approaches. For example, after building a new validation set* to test performance offline, I tried:

Recommending users who follow you (but you don’t follow in return)
Recommending users similar to you (when representing users as sets of their followers, and using cosine similarity and Jaccard similarity as the similarity metric)
Recommending users based on a personalized PageRank score
Recommending users that the people you follow also follow

And so on, combining the votes of these algorithms in a fairly ad-hoc way (e.g., by taking the majority vote or by ordering by the number of followers).

This worked quite well actually, but I’d been planning to move on to a more machine learned model-based approach from the beginning, so I did that next.

*My validation set was formed by deleting random edges from the full training set. A slightly better approach, as mentioned above, might have been to more accurately simulate the distribution of the official test set, but I didn’t end up trying this out myself.

Candidate Selection

In order to run a machine learning algorithm to recommend edges (which would take two nodes, a source and a candidate destination, and generate a score measuring the likelihood that the source would follow the destination), it’s necessary to prune the set of candidates to run the algorithm on.

I used two approaches for this filtering step, both based on random walks on the graph.

Personalized PageRank

The first approach was to calculate a personalized PageRank around each source node.

Briefly, a personalized PageRank is like standard PageRank, except that when randomly teleporting to a new node, the surfer always teleports back to the given source node being personalized (rather than to a node chosen uniformly at random, as in the classic PageRank algorithm).

That is, the random surfer in the personalized PageRank model works as follows:

He starts at the source node $X$ that we want to calculate a personalized PageRank around.
At step $i$: with probability $p$, the surfer moves to a neighboring node chosen uniformly at random; with probability $1-p$, the surfer instead teleports back to the original source node $X$.
The limiting probability that the surfer is at node $N$ is then the personalized PageRank score of node $N$ around $X$.

Here’s some Scala code that computes approximate personalized PageRank scores and takes the highest-scoring nodes as the candidates to feed into the machine learning model:

Personalized PageRank

/**
   * Calculate a personalized PageRank around the given user, and return 
   * a list of the nodes with the highest personalized PageRank scores.
   *
   * @return A list of (node, probability of landing at this node after
   *         running a personalized PageRank for K iterations) pairs.
   */
  def pageRank(user: Int): List[(Int, Double)] = {
    // This map holds the probability of landing at each node, up to the 
    // current iteration.
    val probs = Map[Int, Double]()
    probs(user) = 1 // We start at this user.
  
    val pageRankProbs = pageRankHelper(start, probs, NumPagerankIterations)
    pageRankProbs.toList
                 .sortBy { -_._2 }
                 .filter { case (node, score) =>
                    !getFollowings(user).contains(node) && node != user
                  }
                  .take(MaxNodesToKeep)
  }
  
  /**
   * Simulates running a personalized PageRank for one iteration.
   *
   * Parameters:
   * start - the start node to calculate the personalized PageRank around
   * probs - a map from nodes to the probability of being at that node at 
   *         the start of the current iteration
   * numIterations - the number of iterations remaining
   * alpha - with probability alpha, we follow a neighbor; with probability
   *         1 - alpha, we teleport back to the start node
   *
   * @return A map of node -> probability of landing at that node after the
   *         specified number of iterations.
   */
  def pageRankHelper(start: Int, probs: Map[Int, Double], numIterations: Int,
                     alpha: Double = 0.5): Map[Int, Double] = {
    if (numIterations <= 0) {
      probs
    } else {
      // Holds the updated set of probabilities, after this iteration.
      val probsPropagated = Map[Int, Double]()
  
      // With probability 1 - alpha, we teleport back to the start node.
      probsPropagated(start) = 1 - alpha
  
      // Propagate the previous probabilities...
      probs.foreach { case (node, prob) =>
        val forwards = getFollowings(node)
        val backwards = getFollowers(node)
  
        // With probability alpha, we move to a follower...
        // And each node distributes its current probability equally to 
        // its neighbors.
        val probToPropagate = alpha * prob / (forwards.size + backwards.size)
        (forwards.toList ++ backwards.toList).foreach { neighbor =>
          if (!probsPropagated.contains(neighbor)) {
            probsPropagated(neighbor) = 0
          }
          probsPropagated(neighbor) += probToPropagate
        }
      }
  
      pageRankHelper(start, probsPropagated, numIterations - 1, alpha)
    }
  }
  

Propagation Score

Another approach I used, based on a proposal by another contestant on the Kaggle forums, works as follows:

Start at a specified user node and give it some score.
In the first iteration, this user propagates its score equally to its neighbors.
In the second iteration, each user duplicates and keeps half of its score S. It then propagates S equally to its neighbors.
In subsequent iterations, the process is repeated, except that neighbors reached via a backwards link don’t duplicate and keep half of their score. (The idea is that we want the score to reach followees and not followers.)

Here’s some Scala code to calculate these propagation scores:

Propagation Score

/**
   * Calculate propagation scores around the current user.
   *
   * In the first propagation round, we
   *
   * - Give the starting node N an initial score S.
   * - Propagate the score equally to each of N's neighbors (followers 
   *   and followings).
   * - Each first-level neighbor then duplicates and keeps half of its score
   *   and then propagates the original again to its neighbors.
   *
   * In further rounds, neighbors then repeat the process, except that neighbors 
   * traveled to via a backwards/follower link don't keep half of their score.
   *
   * @return a sorted list of (node, propagation score) pairs.
   */
  def propagate(user: Int): List[(Int, Double)] = {
    val scores = Map[Int, Double]()
  
    // We propagate the score equally to all neighbors.
    val scoreToPropagate = 1.0 / (getFollowings(user).size + getFollowers(user).size)
  
    (getFollowings(user).toList ++ getFollowers(user).toList).foreach { x =>
      // Propagate the score...
      continuePropagation(scores, x, scoreToPropagate, 1)
      // ...and make sure it keeps half of it for itself.
      scores(x) = scores.getOrElse(x, 0: Double) + scoreToPropagate / 2
    }
  
    scores.toList.sortBy { -_._2 }
                 .filter { nodeAndScore =>
                   val node = nodeAndScore._1
                   !getFollowings(user).contains(node) && node != user
                  }
                  .take(MaxNodesToKeep)
  }
  
  /**
   * In further rounds, neighbors repeat the process above, except that neighbors
   * traveled to via a backwards/follower link don't keep half of their score.
   */
  def continuePropagation(scores: Map[Int, Double], user: Int, score: Double,
                          currIteration: Int): Unit = {
    if (currIteration < NumIterations && score > 0) {
      val scoreToPropagate = score / (getFollowings(user).size + getFollowers(user).size)
  
      getFollowings(user).foreach { x =>
        // Propagate the score...        
        continuePropagation(scores, x, scoreToPropagate, currIteration + 1)
        // ...and make sure it keeps half of it for itself.        
        scores(x) = scores.getOrElse(x, 0: Double) + scoreToPropagate / 2
      }
  
      getFollowers(user).foreach { x =>
        // Propagate the score...
        continuePropagation(scores, x, scoreToPropagate, currIteration + 1)
        // ...but backward links (except for the starting node's immediate
        // neighbors) don't keep any score for themselves.
      }
    }
  }
  

I played around with tweaking some parameters in both approaches (e.g., weighting followers and followees differently), but the natural defaults (as used in the code above) ended up performing the best.

Features

After pruning the set of candidate destination nodes to a more feasible level, I fed pairs of (source, destination) nodes into a machine learning model. From each pair, I extracted around 30 features in total.

As mentioned above, one feature that worked quite well on its own was whether the destination node already follows the source.

I also used a wide set of similarity-based features, for example, the Jaccard similarity between the source and destination when both are represented as sets of their followers, when both are represented as sets of their followees, or when one is represented as a set of followers while the other is represented as a set of followees.

Similarity Metrics

abstract class SimilarityMetric[T] {
    def apply(set1: Set[T], set2: Set[T]): Double;
  }
  
  object JaccardSimilarity extends SimilarityMetric[Int] {
    /**
     * Returns the Jaccard similarity between two sets, 0 if both are empty.
     */
    def apply(set1: Set[Int], set2: Set[Int]): Double = {
      val union = (set1.union(set2)).size
  
      if (union == 0) {
        0
      } else {
        (set1 & set2).size.toFloat / union
      }
    }
  
  }
  
  object CosineSimilarity extends SimilarityMetric[Int] {
    /**
     * Returns the cosine similarity between two sets, 0 if both are empty.
     */
    def apply(set1: Set[Int], set2: Set[Int]): Double = {
      if (set1.size == 0 && set2.size == 0) {
        0
      } else {
        (set1 & set2).size.toFloat / (math.sqrt(set1.size * set2.size))
      }
    }
  
  }
  
  // ************
  // * FEATURES *
  // ************
  
  /**
   * Returns the similarity between user1 and user2 when both are represented as
   * sets of followers.
   */
  def similarityByFollowers(user1: Int, user2: Int)
                           (implicit similarity: SimilarityMetric[Int]): Double = {
    similarity.apply(getFollowersWithout(user1, user2),
                     getFollowersWithout(user2, user1))
  }
  
  // etc.
  

Along the same lines, I also computed a similarity score between the destination node and the source node’s followees, and several variations thereof.

Extended Similarity Scores

/**
   * Iterate over each of user1's followings, compute their similarity with
   * user2 when both are represented as sets of followers, and return the 
   * sum of these similarities.
   */
  def followerBasedSimilarityToFollowing(user1: Int, user2: Int)
    (implicit similarity: SimilarityMetric[Int]): Double = {
      getFollowingsWithout(user1, user2)
                          .map { similarityByFollowers(_, user2)(similarity) }
                          .sum
  }
  

Other features included the number of followers and followees of each node, the ratio of these, the personalized PageRank and propagation scores themselves, the number of followers in common, and triangle/closure-type features (e.g., whether the source node is friends with a node X who in turn is a friend of the destination node).

If I had had more time, I would probably have tried weighted and more regularized versions of some of these features as well (e.g., downweighting nodes with large numbers of followers when computing cosine similarity scores based on followees, or shrinking the scores of nodes we have little information about).

Feature Understanding

But what are these features actually doing? Let’s use the same app I built before to take a look.

Here’s the local network of node 317 (different from the node above), where each node is colored by its personalized PageRank (higher scores are in darker red):

If we look at the following vs. follower relationships of the central node (recall that purple is friends, teal is followings, orange is followers):

…we can see that, as expected (because edges that represented both following and follower were double-weighted in my PageRank calculation), the darkest red nodes are those that are friends with the central node, while those in a following-only or follower-only relationship have a lower score.

How does the propagation score compare to personalized PageRank? Here, I colored each node according to the log ratio of its propagation score and personalized PageRank:

Comparing this coloring with the local follow/follower network:

…we can see that followed nodes (in teal) receive a higher propagation weight than friend nodes (in purple), while follower nodes (in orange) receive almost no propagation score at all.

Going back to node 1, let’s look at a different metric. Here, each node is colored according to its Jaccard similarity with the source, when nodes are represented by the set of their followers:

We can see that, while the PageRank and propagation metrics tended to favor nodes close to the central node, the Jaccard similarity feature helps us explore nodes that are further out.

However, if we look the high-scoring nodes more closely, we see that they often have only a single connection to the rest of the network:

In other words, their high Jaccard similarity is due to the fact that they don’t have many connections to begin with. This suggests that some regularization or shrinking is in order.

So here’s a regularized version of Jaccard similarity, where we downweight nodes with few connections:

We can see that the outlier nodes are much more muted this time around.

For a starker difference, compare the following two graphs of the Jaccard similarity metric around node 317 (the first graph is an unregularized version, the second is regularized):

Notice, in particular, how the popular node in the top left and the popular nodes at the bottom have a much higher score when we regularize.

And again, there are other networks and features I haven’t mentioned here, so play around and discover them on the app itself.

Models

For the machine learning algorithms on top of my features, I experimented with two types of models: logistic regression (using both L1 and L2 regularization) and random forests. (If I had more time, I would probably have done some more parameter tuning and maybe tried gradient boosted trees as well.)

So what is a random forest? I wrote an old (layman’s) post on it here, but since nobody ever clicks on these links, let’s copy it over:

Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.
Thus, Willow is a decision tree for your movie preferences.
But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends, and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).
Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you *really really* loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don’t mention Harry Potter at all.
By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.
There’s still one problem with your data, however. While you loved both Titanic and Inception, it wasn’t because you like movies that star Leonardio DiCaprio. Maybe you liked both movies for other reasons. Thus, you don’t want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you’re building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren’t allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you’re injecting randomness at the model level, by making your friends ask different questions at different times.
And so your friends now form a random forest.

Moving on, I essentially trained scikit-learn’s classifiers on an equal split of true and false edges (sampled from the output of my pruning step, in order to match the distribution I’d get when applying my algorithm to the official test set), and compared performance on the validation set I made, with a small amount of parameter tuning:

Random Forest

########################################
  # STEP 1: Read in the training examples.
  ########################################
  truths = [] # A truth is 1 (for a known true edge) or 0 (for a false edge).
  training_examples = [] # Each training example is an array of features.
  for line in open(TRAINING_SET_WITH_FEATURES_FILENAME):
    values = [float(x) for x in line.split(",")]
    truth = values[0]
    training_example_features = values[1:]
  
    truths.append(truth)
    training_examples.append(training_example_features)
  
  #############################
  # STEP 2: Train a classifier.
  #############################
  rf = RandomForestClassifier(n_estimators = 500, compute_importances = True, oob_score = True)
  rf = rf.fit(training_examples, truths)
  

So let’s look at the variable importance scores as determined by one of my random forest models, which (unsurprisingly) consistently outperformed logistic regression.

The random forest classifier here is one of my earlier models (using a slightly smaller subset of my full suite of features), where the targeting step consisted of taking the top 25 nodes with the highest propagation scores.

We can see that the most important variables are:

Personalized PageRank scores. (I put in both normalized and unnormalized versions, where the normalized versions consisted of taking all the candidates for a particular source node, and scaling them so that the maximum personalized PageRank score was 1.)
Whether the destination node already follows the source.
How similar the source node is to the people the destination node is following, when each node is represented as a set of followers. (Note that this is more or less measuring how likely the destination is to follow the source, which we already saw is a good predictor of whether the source is likely to follow the destination.) Plus several variations on this theme (e.g., how similar the destination node is to the source node’s followers, when each node is represented as a set of followees).

Model Comparison

How do all of these models compare to each other? Is the random forest model universally better than the logistic regression model, or are there some sets of users for which the logistic regression model actually performs better?

To enable these kinds of comparisons, I made a small module that allows you to select two models and then visualize their sliced performance.

(Go ahead, play around.)

Above, I bucketed all test nodes into buckets based on (the logarithm of) their number of followers, and compared the mean average precision of two algorithms: one that recommends nodes to follow using a personalized PageRank alone, and one that recommends nodes that are following the source user but are not followed back in return.

We see that except for the case of 0 followers (where the “is followed by” algorithm can do nothing), the personalized PageRank algorithm gets increasingly better in comparison: at first, the two algorithms have roughly equal performance, but as the source node gets more followers, the personalized PageRank algorithm dominates.

And here’s an embedded version you can interact with directly:

Admittedly, building a slicer like this is probably overkill for a Kaggle competition, where the set of variables is fairly limited. But imagine having something similar for a real world model, where new algorithms are tried out every week and we can slice the performance by almost any dimension we can imagine (by geography, to make sure we don’t improve Australia at the expense of the UK; by user interests, to see where we could improve the performance of topic inference; by number of user logins, to make sure we don’t sacrifice the performance on new users for the gain of the core).

Mathematicians do it with Matrices

Let’s switch directions slightly and think about how we could rewrite our computations in a different, matrix-oriented style. (I didn’t do this in the competition – this is more a preview of another post I’m writing.)

Personalized PageRank in Scalding

Personalized PageRank, for example, is an obvious fit for a matrix rewrite. Here’s how it would look in Scalding’s new Matrix library:

(For those who don’t know, Scalding is a Hadoop framework that Twitter released at the beginning of the year; see my post on building a big data recommendation engine in Scalding for an introduction.)

Personalized PageRank, Matrix Style

// ***********************************************
  // STEP 1. Load the adjacency graph into a matrix.
  // ***********************************************
  
  val following = Tsv(GraphFilename, ('user1, 'user2, 'weight))
  
  // Binary matrix where cell (u1, u2) means that u1 follows u2.
  val followingMatrix =
    following.toMatrix[Int,Int,Double]('user1, 'user2, 'weight)
  
  // Binary matrix where cell (u1, u2) means that u1 is followed by u2.  
  val followerMatrix = followingMatrix.transpose
  
  // Note: we could also form this adjacency matrix differently, by placing
  // different weights on the following vs. follower edges.
  val undirectedAdjacencyMatrix =
    (followingMatrix + followerMatrix).rowL1Normalize
  
  // Create a diagonal users matrix (to be used in the "teleportation back
  // home" step).
  val usersMatrix =
    following.unique('user1)
             .map('user1 -> ('user2, 'weight)) { user1: Int => (user1, 1) }
             .toMatrix[Int, Int, Double]('user1, 'user2, 'weight)
  
  // ***************************************************
  // STEP 2. Compute the personalized PageRank scores.
  // See http://nlp.stanford.edu/projects/pagerank.shtml
  // for more information on personalized PageRank.
  // ***************************************************
  
  // Compute personalized PageRank by running for three iterations,
  // and output the top candidates.
  val pprScores = personalizedPageRank(usersMatrix, undirectedAdjacencyMatrix, usersMatrix, 0.5, 3)
  pprScores.topRowElems(numCandidates).write(Tsv(OutputFilename))
  
  /**
   * Performs a personalized PageRank iteration. The ith row contains the
   * personalized PageRank probabilities around node i.
   *
   * Note the interpretation: 
   *   - with probability 1 - alpha, we go back to where we started.
   *   - with probability alpha, we go to a neighbor.
   *
   * Parameters:
   *   
   *   startMatrix - a (usually diagonal) matrix, where the ith row specifies
   *                 where the ith node teleports back to.
   *   adjacencyMatrix
   *   prevMatrix - a matrix whose ith row contains the personalized PageRank
   *                probabilities around the ith node.
   *   alpha - the probability of moving to a neighbor (as opposed to
   *           teleporting back to the start).
   *   numIterations - the number of personalized PageRank iterations to run. 
   */
  def personalizedPageRank(startMatrix: Matrix[Int, Int, Double],
                           adjacencyMatrix: Matrix[Int, Int, Double],
                           prevMatrix: Matrix[Int, Int, Double],
                           alpha: Double,
                           numIterations: Int): Matrix[Int, Int, Double] = {
      if (numIterations <= 0) {
        prevMatrix
      } else {
        val updatedMatrix = startMatrix * (1 - alpha) +
                            (prevMatrix * adjacencyMatrix) * alpha
        personalizedPageRank(startMatrix, adjacencyMatrix, updatedMatrix, alpha, numIterations - 1)
      }
  }
  

Not only is this matrix formulation a more natural way of expressing the algorithm, but since Scalding (by way of Cascading) supports both local and distributed modes, this code runs just as easily on a Hadoop cluster of thousands of machines (assuming our social network is orders of magnitude larger than the one in the contest) as on a sample of data in a laptop. Big data, big matrix style, BOOM.

Cosine Similarity as L2-Normalized Multiplication

Here’s another example. Calculating cosine similarity between all users is a natural fit for a matrix formulation since, after all, the cosine similarity between two vectors is just their L2-normalized dot product:

Cosine Similarity, Matrix Style

// A matrix where the cell (i, j) is 1 iff user i is followed by user j.
  val followerMatrix = ...
  
  // A matrix where cell (i, j) holds the cosine similarity between
  // user i and user j, when both are represented as sets of their followers.
  val followerBasedSimilarityMatrix =
    followerMatrix.rowL2Normalize * followerMatrix.rowL2Normalize.transpose
  

A Similarity Extension

But let’s go one step further.

To change examples for ease of exposition: suppose you’ve bought a bunch of books on Amazon, and Amazon wants to recommend a new book you’ll like. Since Amazon knows similarities between all pairs of books, one natural way to generate this recommendation is to:

Take every book B.
Calculate the similarity between B and each book you bought.
Sum up all these similarities to get your recommendation score for B.

In other words, the recommendation score for book B on user U is:

DidUserBuy(U, Book 1) * SimilarityBetween(Book B, Book 1) + DidUserBuy(U, Book 2) * SimilarityBetween(Book B, Book2) + … + DidUserBuy(U, Book n) * SimilarityBetween(Book B, Book n)

This, too, is a dot product! So it can also be rewritten as a matrix multiplication:

// A matrix where cell (i, j) holds the similarity between books i and j.
  val bookSimilarityMatrix = ...
  
  // A matrix where cell (i, j) is 1 if user i has bought book j, 
  // and 0 otherwise.
  val userPurchaseMatrix = ...
  
  // A matrix where cell (i, j) holds the recommendation score of
  // book j to user i.
  val recommendationMatrix = userPurchaseMatrix * bookSimilarityMatrix
  

Of course, there’s a natural analogy between this score and the feature I described a while back above, where I compute a similarity score between a destination node and a source node’s followees (when all nodes are represented as sets of followers):

/**
   * Iterate over each of user1's followings, compute their similarity
   * with user2 when both are represented as sets of followers, and return
   * the sum of these similarities.
   */
  def followerBasedSimilarityToFollowings(user1: Int, user2: Int)
      (implicit similarity: SimilarityMetric[Int]): Double = {
    getFollowingsWithout(user1, user2)
                        .map { similarityByFollowers(_, user2)(similarity) }
                        .sum
  }
  
  /**
   * The matrix version of the above function.
   *
   * Why are these the same? Note that the above function simply computes:
   *   DoesUserFollow(User A, User 1) * Similarity(User 1, User B) + 
   *     DoesUserFollow(User A, User 2) * Similarity(User 2, User B) + ... + 
   *     DoesUserFollow(User A, User n) * Similarity(User n, User B)
   */
  val followingMatrix = ...
  val followerBasedSimilarityMatrix =
    followerMatrix.rowL2Normalize * followerMatrix.rowL2Normalize.transpose
  
  val followerBasedSimilarityToFollowingsMatrix =
    followingMatrix * followerBasedSimilarityMatrix
  

For people comfortable expressing their computations in a vector manner, writing your computations as matrix manipulations often makes experimenting with different algorithms much more fluid. Imagine, for example, that you want to switch from L1 normalization to L2 normalization, or that you want to express your objects as binary sets rather than weighted vectors. Both of these become simple one-line changes when you have vectors and matrices as first-class objects, but are much more tedious (especially in a MapReduce land where this matrix library was designed to be applied!) when you don’t.

Finish Line

By now, I think I’ve spent more time writing this post than on the contest itself, so let’s wrap up.

I often get asked what kinds of tools I like to use, so for this competition my kit consisted of:

Scala, for code that needed to be fast (e.g., extracting features) or that I was going to run repeatedly (e.g., scoring my validation set).
Python, for my machine learning models, because scikit-learn is awesome.
Ruby, for quick one-off scripts.
R, for some data analysis and simple plotting.
Coffeescript and d3, for the interactive visualizations.

Finally, I put up a Github repository containing some code, and here are a couple other posts I’ve written that people who like this entry might also enjoy:

Information transmission in a social network, a case study in how information propagates through a social graph.
Movie recommendations in Scalding, Twitter’s Scala-based Hadoop framework built on top of Cascading.
A summary of the algorithms behind the Netflix Prize, another crowdsourced recommendation contest for predicting movie ratings.

Soda vs. Pop with Twitter

2012-07-06T04:15:00+02:00

One of the great things about Twitter is that it’s a global conversation anyone can join anytime. Eavesdropping on the world, what what!

Of course, it gets even better when you can mine all this chatter to study the way humans live and interact.

For example, how do people in New York City differ from those in Silicon Valley? We tend to think they’re more financially driven and restless with the world – is this true, and if so, how much more?

Or how does language change as you travel to different regions? Recall the classic soda vs. pop. vs. coke question: some people use the word “soda” to describe their soft drinks, others use “pop”, and still others use “coke”. Who says what where?

Let’s take a look.

To make this map, I sampled geo-tagged tweets containing the words “soda”, “pop”, or “coke”, performed some state-of-the-art NLP technology to ensure the tweets were soft drink related (e.g., the tweets had to contain “drink soda” or “drink a pop”), and tried to filter out coke tweets that were specifically about the Coke brand (e.g., Coke Zero).

It’s a little cluttered, though, so let’s clean it up by aggregating nearby tweets.

Here, I bucketed all tweets within a 0.333 latitude/longitude radius, calculated the term distribution within each bucket, and colored each bucket with the word furthest from its overall mean. I also sized each point according to the (log-transformed) number of tweets in the bucket.

We can see that:

The South is pretty Coke-heavy.
Soda belongs to the Northeast and far West.
Pop gets the mid-West, except for some interesting spots of blue around Wisconsin and the Illinois-Missouri border.

For comparison, here’s another map based on a survey at popvssoda.com.

We can see similar patterns, though interestingly, our map has less Coke in the Southeast and less pop in the Northwest.

Finally, here’s a world map of the terms, bucketed again. Notice that “pop” seems to be prevalent only in parts of the United States and Canada.

As some astute readers noted, though, the seeming dominance of coke is probably due to the difficulty in distinguishing the generic use of coke for soft drinks in general from the particular use of coke for referring to the Coca-Cola brand.

So let’s instead look at a world map of a couple other soft drink terms (“fizzy drink”, “mineral”, and “tonic”):

Notice that:

“Fizzy drink” shows up for the UK, New Zealand, and Maine.
“Tonic” appears in Massachusetts.
While South Africa gets “fizzy drink”, Nigeria gets “mineral”.

I’ve been getting a lot of questions lately about interesting things you can do with the Twitter API, so this was just one small project I’ve worked on to illustrate. This paper contains another awesome application of Twitter data to geographic language variation, and just for fun, here are a few other cute mini-projects:

What do people eat during the Super Bowl? (wings and beer, apparently)

What do people want for Christmas, compared to what they actually get?

What do guys and girls really say?

When were people losing and gaining power during Hurricane Sandy? (click the image to interact)

How does information of a geographic-specific nature spread? (click the image to see a dynamic visualization of when and where tweets related to surviving Hurricane Sandy were shared)

Can we use Twitter to measure presidential votes? (yes!)

Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process

2012-03-20T04:15:00+01:00

Imagine you’re a budding chef. A data-curious one, of course, so you start by taking a set of foods (pizza, salad, spaghetti, etc.) and ask 10 friends how much of each they ate in the past day.

Your goal: to find natural groups of foodies, so that you can better cater to each cluster’s tastes. For example, your fratboy friends might love wings and beer, your anime friends might love soba and sushi, your hipster friends probably dig tofu, and so on.

So how can you use the data you’ve gathered to discover different kinds of groups?

One way is to use a standard clustering algorithm like k-means or Gaussian mixture modeling (see this previous post for a brief introduction). The problem is that these both assume a fixed number of clusters, which they need to be told to find. There are a couple methods for selecting the number of clusters to learn (e.g., the gap and prediction strength statistics), but the problem is a more fundamental one: most real-world data simply doesn’t have a fixed number of clusters.

That is, suppose we’ve asked 10 of our friends what they ate in the past day, and we want to find groups of eating preferences. There’s really an infinite number of foodie types (carnivore, vegan, snacker, Italian, healthy, fast food, heavy eaters, light eaters, and so on), but with only 10 friends, we simply don’t have enough data to detect them all. (Indeed, we’re limited to 10 clusters!) So whereas k-means starts with the incorrect assumption that there’s a fixed, finite number of clusters that our points come from, no matter if we feed it more data, what we’d really like is a method positing an infinite number of hidden clusters that naturally arise as we ask more friends about their food habits. (For example, with only 2 data points, we might not be able to tell the difference between vegans and vegetarians, but with 200 data points, we probably could.)

Luckily for us, this is precisely the purview of nonparametric Bayes.*

*Nonparametric Bayes refers to a class of techniques that allow some parameters to change with the data. In our case, for example, instead of fixing the number of clusters to be discovered, we allow it to grow as more data comes in.

A Generative Story

Let’s describe a generative model for finding clusters in any set of data. We assume an infinite set of latent groups, where each group is described by some set of parameters. For example, each group could be a Gaussian with a specified mean $\mu_i$ and standard deviation $\sigma_i$, and these group parameters themselves are assumed to come from some base distribution $G_0$. Data is then generated in the following manner:

Select a cluster.
Sample from that cluster to generate a new point.

(Note the resemblance to a finite mixture model.)

For example, suppose we ask 10 friends how many calories of pizza, salad, and rice they ate yesterday. Our groups could be:

A Gaussian centered at (pizza = 5000, salad = 100, rice = 500) (i.e., a pizza lovers group).
A Gaussian centered at (pizza = 100, salad = 3000, rice = 1000) (maybe a vegan group).
A Gaussian centered at (pizza = 100, salad = 100, rice = 10000) (definitely Asian).
…

When deciding what to eat when she woke up yesterday, Alice could have thought girl, I’m in the mood for pizza and her food consumption yesterday would have been a sample from the pizza Gaussian. Similarly, Bob could have spent the day in Chinatown, thereby sampling from the Asian Gaussian for his day’s meals. And so on.

The big question, then, is: how do we assign each friend to a group?

Assigning Groups

Chinese Restaurant Process

One way to assign friends to groups is to use a Chinese Restaurant Process. This works as follows: Imagine a restaurant where all your friends went to eat yesterday…

Initially the restaurant is empty.
The first person to enter (Alice) sits down at a table (selects a group). She then orders food for the table (i.e., she selects parameters for the group); everyone else who joins the table will then be limited to eating from the food she ordered.
The second person to enter (Bob) sits down at a table. Which table does he sit at? With probability $\alpha / (1 + \alpha)$ he sits down at a new table (i.e., selects a new group) and orders food for the table; with probability $1 / (1 + \alpha)$ he sits with Alice and eats from the food she’s already ordered (i.e., he’s in the same group as Alice).
…
The (n+1)-st person sits down at a new table with probability $\alpha / (n + \alpha)$, and at table k with probability $n_k / (n + \alpha)$, where $n_k$ is the number of people currently sitting at table k.

Note a couple things:

The more people (data points) there are at a table (cluster), the more likely it is that people (new data points) will join it. In other words, our groups satisfy a rich get richer property.
There’s always a small probability that someone joins an entirely new table (i.e., a new group is formed).
The probability of a new group depends on $\alpha$. So we can think of $\alpha$ as a dispersion parameter that affects the dispersion of our datapoints. The lower alpha is, the more tightly clustered our data points; the higher it is, the more clusters we have in any finite set of points.

(Also notice the resemblance between table selection probabilities and a Dirichlet distribution…)

Just to summarize, given n data points, the Chinese Restaurant Process specifies a distribution over partitions (table assignments) of these points. We can also generate parameters for each partition/table from a base distribution $G_0$ (for example, each table could represent a Gaussian whose mean and standard deviation are sampled from $G_0$), though to be clear, this is not part of the CRP itself.

Code

Since code makes everything better, here’s some Ruby to simulate a CRP:

Chinese Restaurant Process chinese_restaurant_process.rb

# Generate table assignments for `num_customers` customers, according to
  # a Chinese Restaurant Process with dispersion parameter `alpha`.
  #
  # returns an array of integer table assignments
  def chinese_restaurant_process(num_customers, alpha)
   return [] if num_customers <= 0
  
   table_assignments = [1] # first customer sits at table 1
   next_open_table = 2 # index of the next empty table
  
   # Now generate table assignments for the rest of the customers.
   1.upto(num_customers - 1) do |i|
     if rand < alpha.to_f / (alpha + i)
       # Customer sits at new table.
       table_assignments << next_open_table
       next_open_table += 1
     else
       # Customer sits at an existing table.
       # He chooses which table to sit at by giving equal weight to each
       # customer already sitting at a table. 
       which_table = table_assignments[rand(table_assignments.size)]
       table_assignments << which_table
     end
   end
  
   table_assignments
  end
  

And here’s some sample output:

Chinese Restaurant Process chinese_restaurant_process.rb

> chinese_restaurant_process(num_customers = 10, alpha = 1)
  1, 2, 3, 4, 3, 3, 2, 1, 4, 3 # table assignments from run 1
  1, 1, 1, 1, 1, 1, 2, 2, 1, 3 # table assignments from run 2
  1, 2, 2, 1, 3, 3, 2, 1, 3, 4 # table assignments from run 3
  
  > chinese_restaurant_process(num_customers = 10, alpha = 3)
  1, 2, 1, 1, 3, 1, 2, 3, 4, 5
  1, 2, 3, 3, 4, 3, 4, 4, 5, 5
  1, 1, 2, 3, 1, 4, 4, 3, 1, 1
  
  > chinese_restaurant_process(num_customers = 10, alpha = 5)
  1, 2, 1, 3, 4, 5, 6, 7, 1, 8
  1, 2, 3, 3, 4, 5, 6, 5, 6, 7
  1, 2, 3, 4, 5, 6, 2, 7, 2, 1
  

Notice that as we increase $\alpha$, so too does the number of distinct tables increase.

Polya Urn Model

Another method for assigning friends to groups is to follow the Polya Urn Model. This is basically the same model as the Chinese Restaurant Process, just with a different metaphor.

We start with an urn containing $\alpha G_0(x)$ balls of “color” $x$, for each possible value of $x$. ($G_0$ is our base distribution, and $G_0(x)$ is the probability of sampling $x$ from $G_0$). Note that these are possibly fractional balls.
At each time step, draw a ball from the urn, note its color, and then drop both the original ball plus a new ball of the same color back into the urn.

Note the connection between this process and the CRP: balls correspond to people (i.e., data points), colors correspond to table assignments (i.e., clusters), alpha is again a dispersion parameter (put differently, a prior), colors satisfy a rich-get-richer property (since colors with many balls are more likely to get drawn), and so on. (Again, there’s also a connection between this urn model and the urn model for the (finite) Dirichlet distribution…)

To be precise, the difference between the CRP and the Polya Urn Model is that the CRP specifies only a distribution over partitions (i.e., table assignments), but doesn’t assign parameters to each group, whereas the Polya Urn Model does both.

Code

Again, here’s some code for simulating a Polya Urn Model:

Polya Urn Model polya_urn_model.rb

# Draw `num_balls` colored balls according to a Polya Urn Model
  # with a specified base color distribution and dispersion parameter
  # `alpha`.
  #
  # returns an array of ball colors
  def polya_urn_model(base_color_distribution, num_balls, alpha)
    return [] if num_balls <= 0
  
    balls_in_urn = []
    0.upto(num_balls - 1) do |i|
      if rand < alpha.to_f / (alpha + balls_in_urn.size)
        # Draw a new color, put a ball of this color in the urn.
        new_color = base_color_distribution.call
        balls_in_urn << new_color
      else
        # Draw a ball from the urn, add another ball of the same color.
        ball = balls_in_urn[rand(balls_in_urn.size)]
        balls_in_urn << ball
      end
    end
  
    balls_in_urn
  end
  

And here’s some sample output, using a uniform distribution over the unit interval as the color distribution to sample from:

Polya Urn Model polya_urn_model.rb

> unit_uniform = lambda { (rand * 100).to_i / 100.0 }
  
  > polya_urn_model(unit_uniform, num_balls = 10, alpha = 1)
  0.27, 0.89, 0.89, 0.89, 0.73, 0.98, 0.43, 0.98, 0.89, 0.53 # colors in the urn from run 1
  0.26, 0.26, 0.46, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.85 # colors in the urn from run 2
  0.96, 0.87, 0.96, 0.87, 0.96, 0.96, 0.87, 0.96, 0.96, 0.96 # colors in the urn from run 3
  

Code, Take 2

Here’s the same code for a Polya Urn Model, but in R:

Polya Urn Model polya_urn_model.R

# Return a vector of `num_balls` ball colors according to a Polya Urn Model
  # with dispersion `alpha`, sampling from a specified base color distribution.
  polya_urn_model = function(base_color_distribution, num_balls, alpha) {
    balls = c()
  
    for (i in 1:num_balls) {
      if (runif(1) < alpha / (alpha + length(balls))) {
        # Add a new ball color.
        new_color = base_color_distribution()
        balls = c(balls, new_color)
      } else {
        # Pick out a ball from the urn, and add back a
        # ball of the same color.
        ball = balls[sample(1:length(balls), 1)]
        balls = c(balls, ball)
      }
    }
  
    balls
  }
  

Here are some sample density plots of the colors in the urn, when using a unit normal as the base color distribution:

Notice that as alpha increases (i.e., we sample more new ball colors from our base; i.e., as we place more weight on our prior), the colors in the urn tend to a unit normal (our base color distribution).

And here are some sample plots of points generated by the urn, for varying values of alpha:

Each color in the urn is sampled from a uniform distribution over [0,10]x[0,10] (i.e., a [0, 10] square).
Each group is a Gaussian with standard deviation 0.1 and mean equal to its associated color, and these Gaussian groups generate points.

Notice that the points clump together in fewer clusters for low values of alpha, but become more dispersed as alpha increases.

Stick-Breaking Process

Imagine running either the Chinese Restaurant Process or the Polya Urn Model without stop. For each group $i$, this gives a proportion $w_i$ of points that fall into group $i$.

So instead of running the CRP or Polya Urn model to figure out these proportions, can we simply generate them directly?

This is exactly what the Stick-Breaking Process does:

Start with a stick of length one.
Generate a random variable $\beta_1 \sim Beta(1, \alpha)$. By the definition of the Beta distribution, this will be a real number between 0 and 1, with expected value $1 / (1 + \alpha)$. Break off the stick at $\beta_1$; $w_1$ is then the length of the stick on the left.
Now take the stick to the right, and generate $\beta_2 \sim Beta(1, \alpha)$. Break off the stick $\beta_2$ into the stick. Again, $w_2$ is the length of the stick to the left, i.e., $w_2 = (1 - \beta_1) \beta_2$.
And so on.

Thus, the Stick-Breaking process is simply the CRP or Polya Urn Model from a different point of view. For example, assigning customers to table 1 according to the Chinese Restaurant Process is equivalent to assigning customers to table 1 with probability $w_1$.

Code

Here’s some R code for simulating a Stick-Breaking process:

Stick-Breaking Process stick_breaking_process.R

# Return a vector of weights drawn from a stick-breaking process
  # with dispersion `alpha`.
  #
  # Recall that the kth weight is
  #   \beta_k = (1 - \beta_1) * (1 - \beta_2) * ... * (1 - \beta_{k-1}) * beta_k
  # where each $\\beta\_i$ is drawn from a Beta distribution
  #   \beta_i ~ Beta(1, \alpha)
  stick_breaking_process = function(num_weights, alpha) {
    betas = rbeta(num_weights, 1, alpha)
    remaining_stick_lengths = c(1, cumprod(1 - betas))[1:num_weights]
    weights = remaining_stick_lengths * betas
    weights
  }
  

And here’s some sample output:

Notice that for low values of alpha, the stick weights are concentrated on the first few weights (meaning our data points are concentrated on a few clusters), while the weights become more evenly dispersed as we increase alpha (meaning we posit more clusters in our data points).

Dirichlet Process

Suppose we run a Polya Urn Model several times, where we sample colors from a base distribution $G_0$. Each run produces a distribution of colors in the urn (say, 5% blue balls, 3% red balls, 2% pink balls, etc.), and the distribution will be different each time (for example, 5% blue balls in run 1, but 1% blue balls in run 2).

For example, let’s look again at the plots from above, where I generated samples from a Polya Urn Model with the standard unit normal as the base distribution:

Each run of the Polya Urn Model produces a slighly different distribution, though each is “centered” in some fashion around the standard Gaussian I used as base. In other words, the Polya Urn Model gives us a distribution over distributions (we get a distribution of ball colors, and this distribution of colors changes each time) – and so we finally get to the Dirichlet Process.

Formally, given a base distribution $G_0$ and a dispersion parameter $\alpha$, a sample from the Dirichlet Process $DP(G_0, \alpha)$ is a distribution $G \sim DP(G_0, \alpha)$. This sample $G$ can be thought of as a distribution of colors in a single simulation of the Polya Urn Model; sampling from $G$ gives us the balls in the urn.

So here’s the connection between the Chinese Restaurant Process, the Polya Urn Model, the Stick-Breaking Process, and the Dirichlet Process:

Dirichlet Process: Suppose we want samples $x_i \sim G$, where $G$ is a distribution sampled from the Dirichlet Process $G \sim DP(G_0, \alpha)$.
Polya Urn Model: One way to generate these values $x_i$ would be to take a Polya Urn Model with color distribution $G_0$ and dispersion $\alpha$. ($x_i$ would be the color of the ith ball in the urn.)
Chinese Restaurant Process: Another way to generate $x_i$ would be to first assign tables to customers according to a Chinese Restaurant Process with dispersion $\alpha$. Every customer at the nth table would then be given the same value (color) sampled from $G_0$. ($x_i$ would be the value given to the ith customer; $x_i$ can also be thought of as the food at table $i$, or as the parameters of table $i$.)
Stick-Breaking Process: Finally, we could generate weights $w_k$ according to a Stick-Breaking Process with dispersion $\alpha$. Next, we would give each weight $w_k$ a value (or color) $v_k$ sampled from $G_0$. Finally, we would assign $x_i$ to value (color) $v_k$ with probability $w_k$.

Recap

Let’s summarize what we’ve discussed so far.

We have a bunch of data points $p_i$ that we want to cluster, and we’ve described four essentially equivalent generative models that allow us to describe how each cluster and point could have arisen.

In the Chinese Restaurant Process:

We generate table assignments $g_1, \ldots, g_n \sim CRP(\alpha)$ according to a Chinese Restaurant Process. ($g_i$ is the table assigned to datapoint $i$.)
We generate table parameters $\phi_1, \ldots, \phi_m \sim G_0$ according to the base distribution $G_0$, where $\phi_k$ is the parameter for the kth distinct group.
Given table assignments and table parameters, we generate each datapoint $p_i \sim F(\phi_{g_i})$ from a distribution $F$ with the specified table parameters. (For example, $F$ could be a Gaussian, and $\phi_i$ could be a parameter vector specifying the mean and standard deviation).

In the Polya Urn Model:

We generate colors $\phi_1, \ldots, \phi_n \sim Polya(G_0, \alpha)$ according to a Polya Urn Model. ($\phi_i$ is the color of the ith ball.)
Given ball colors, we generate each datapoint $p_i \sim F(\phi_i)$.

In the Stick-Breaking Process:

We generate group probabilities (stick lengths) $w_1, \ldots, w_{\infty} \sim Stick(\alpha)$ according to a Stick-Breaking process.
We generate group parameters $\phi_1, \ldots, \phi_{\infty} \sim G_0$ from $G_0$, where $\phi_k$ is the parameter for the kth distinct group.
We generate group assignments $g_1, \ldots, g_n \sim Multinomial(w_1, \ldots, w_{\infty})$ for each datapoint.
Given group assignments and group parameters, we generate each datapoint $p_i \sim F(\phi_{g_i})$.

In the Dirichlet Process:

We generate a distribution $G \sim DP(G_0, \alpha)$ from a Dirichlet Process with base distribution $G_0$ and dispersion parameter $\alpha$.
We generate group-level parameters $x_i \sim G$ from $G$, where $x_i$ is the group parameter for the ith datapoint. (Note: this is not the same as $\phi_i$. $x_i$ is the parameter associated to the group that the ith datapoint belongs to, whereas $\phi_k$ is the parameter of the kth distinct group.)
Given group-level parameters $x_i$, we generate each datapoint $p_i \sim F(x_i)$.

Also, remember that each model naturally allows the number of clusters to grow as more points come in.

Inference in the Dirichlet Process Mixture

So we’ve described a generative model that allows us to calculate the probability of any particular set of group assignments to data points, but we haven’t described how to actually learn a good set of group assignments.

Let’s briefly do this now. Very roughly, the Gibbs sampling approach works as follows:

Take the set of data points, and randomly initialize group assignments.
Pick a point. Fix the group assignments of all the other points, and assign the chosen point a new group (which can be either an existing cluster or a new cluster) with a CRP-ish probability (as described in the models above) that depends on the group assignments and values of all the other points.
We will eventually converge on a good set of group assignments, so repeat the previous step until happy.

For more details, this paper provides a good description. Philip Resnick and Eric Hardisty also have a friendlier, more general description of Gibbs sampling (plus an application to naive Bayes) here.

Fast Food Application: Clustering the McDonald’s Menu

Finally, let’s show an application of the Dirichlet Process Mixture. Unfortunately, I didn’t have a data set of people’s food habits offhand, so instead I took this list of McDonald’s foods and nutrition facts.

After normalizing each item to have an equal number of calories, and representing each item as a vector of (total fat, cholesterol, sodium, dietary fiber, sugars, protein, vitamin A, vitamin C, calcium, iron, calories from fat, satured fat, trans fat, carbohydrates), I ran scikit-learn’s Dirichlet Process Gaussian Mixture Model to cluster McDonald’s menu based on nutritional value.

First, how does the number of clusters inferred by the Dirichlet Process mixture vary as we feed in more (randomly ordered) points?

As expected, the Dirichlet Process model discovers more and more clusters as more and more food items arrive. (And indeed, the number of clusters appears to grow logarithmically, which can in fact be proved.)

How many clusters does the mixture model infer from the entire dataset? Running the Gibbs sampler several times, we find that the number of clusters tends around 11:

Let’s dive into one of these clusterings.

Cluster 1 (Desserts)

Looking at a sample of foods from the first cluster, we find a lot of desserts and dessert-y drinks:

Caramel Mocha
Frappe Caramel
Iced Hazelnut Latte
Iced Coffee
Strawberry Triple Thick Shake
Snack Size McFlurry
Hot Caramel Sundae
Baked Hot Apple Pie
Cinnamon Melts
Kiddie Cone
Strawberry Sundae

We can also look at the nutritional profile of some foods from this cluster (after z-scaling each nutrition dimension to have mean 0 and standard deviation 1):

We see that foods in this cluster tend to be high in trans fat and low in vitamins, protein, fiber, and sodium.

Cluster 2 (Sauces)

Here’s a sample from the second cluster, which contains a lot of sauces:

Hot Mustard Sauce
Spicy Buffalo Sauce
Newman’s Own Low Fat Balsamic Vinaigrette

And looking at the nutritional profile of points in this cluster, we see that it’s heavy in sodium and fat:

Cluster 3 (Burgers, Crispy Foods, High-Cholesterol)

The third cluster is very burgery:

Hamburger
Cheeseburger
Filet-O-Fish
Quarter Pounder with Cheese
Premium Grilled Chicken Club Sandwich
Ranch Snack Wrap
Premium Asian Salad with Crispy Chicken
Butter Garlic Croutons
Sausage McMuffin
Sausage McGriddles

It’s also high in fat and sodium, and low in carbs and sugar

Cluster 4 (Creamy Sauces)

Interestingly, even though we already found a cluster of sauces above, we discover another one as well. These sauces appear to be much more cream-based:

Creamy Ranch Sauce
Newman’s Own Creamy Caesar Dressing
Coffee Cream
Iced Coffee with Sugar Free Vanilla Syrup

Nutritionally, these sauces are higher in calories from fat, and much lower in sodium:

Cluster 5 (Salads)

Here’s a salad cluster. A lot of salads also appeared in the third cluster (along with hamburgers and McMuffins), but that’s because those salads also all contained crispy chicken. The salads in this cluster are either crisp-free or have their chicken grilled instead:

Premium Southwest Salad with Grilled Chicken
Premium Caesar Salad with Grilled Chicken
Side Salad
Premium Asian Salad without Chicken
Premium Bacon Ranch Salad without Chicken

This is reflected in the higher content of iron, vitamin A, and fiber:

Cluster 6 (More Sauces)

Again, we find another cluster of sauces:

Ketchup Packet
Barbeque Sauce
Chipotle Barbeque Sauce

These are still high in sodium, but much lower in fat compared to the other sauce clusters:

Cluster 7 (Fruit and Maple Oatmeal)

Amusingly, fruit and maple oatmeal is in a cluster by itself:

Fruit & Maple Oatmeal

Cluster 8 (Sugary Drinks)

We also get a cluster of sugary drinks:

Strawberry Banana Smoothie
Wild Berry Smoothie
Iced Nonfat Vanilla Latte
Nonfat Hazelnut
Nonfat Vanilla Cappuccino
Nonfat Caramel Cappuccino
Sweet Tea
Frozen Strawberry Lemonade
Coca-Cola
Minute Maid Orange Juice

In addition to high sugar content, this cluster is also high in carbohydrates and calcium, and low in fat.

Cluster 9 (Breakfast Foods)

Here’s a cluster of high-cholesterol breakfast foods:

Sausage McMuffin with Egg
Sausage Burrito
Egg McMuffin
Bacon, Egg & Cheese Biscuit
McSkillet Burrito with Sausage
Big Breakfast with Hotcakes

Cluster 10 (Coffee Drinks)

We find a group of coffee drinks next:

Nonfat Cappuccino
Nonfat Latte
Nonfat Latte with Sugar Free Vanilla Syrup
Iced Nonfat Latte

These are much higher in calcium and protein, and lower in sugar, than the other drink cluster above:

Cluster 11 (Apples)

Here’s a cluster of apples:

Apple Dippers with Low Fat Caramel Dip
Apple Slices

Vitamin C, check.

And finally, here’s an overview of all the clusters at once (using a different clustering run):

No More!

I’ll end with a couple notes:

Kevin Knight has a hilarious introduction to Bayesian inference that describes some applications of nonparametric Bayesian techniques to computational linguistics (though I don’t think he ever quite says “nonparametric Bayes” directly).
In the Chinese Restaurant Process, each customer sits at a single table. The Indian Buffet Process is an extension that allows customers to sample food from multiple tables (i.e., belong to multiple clusters).
The Chinese Restaurant Process, the Polya Urn Model, and the Stick-Breaking Process are all sequential models for generating groups: to figure out table parameters in the CRP, for example, you wait for customer 1 to come in, then customer 2, then customer 3, and so on. The equivalent Dirichlet Process, on the other hand, is a parallel model for generating groups: just sample $G \sim DP(G_0, alpha)$, and then all your group parameters can be independently generated by sampling from $G$ at once. This duality is an instance of a more general phenomenon known as de Finetti’s theorem.

And that’s it.

Instant Interactive Visualization with d3 + ggplot2

2012-03-05T04:15:00+01:00

It’s often easier to understand a chart than a table. So why is it still so hard to make a simple data graphic, and why am I still bombarded by mind-numbing reams of raw numbers?

(Yeah, I love ggplot2 to death. But sometimes I want a little more interaction, and sometimes all I want is to drag-and-drop and be done.)

So I’ve been experimenting with a small, ggplot2-inspired d3 app.

Simply drop a file, and bam! Instant scatterplot:

But wait – that’s only 2 dimensions. You can add some more through color, size, and groups:

(Click here to play with the data yourself.)

And you can easily switch which variables are getting plotted, and see all the information associated with each point.

(Same dataset, different aesthetic assignments.)

I’m thinking of adding more kinds of charts, support for categorical variables, more interactivity (sliders to interact with other dimensions?!), and making the UI even easier (e.g., simplify column naming). In the meantime, the code is here on Github, and tips and suggestions are welcome!

Movie Recommendations and More via MapReduce and Scalding

2012-02-09T04:15:00+01:00

Scalding is an in-house MapReduce framework that Twitter recently open-sourced. Like Pig, it provides an abstraction on top of MapReduce that makes it easy to write big data jobs in a syntax that’s simple and concise. Unlike Pig, Scalding is written in pure Scala – which means all the power of Scala and the JVM is already built-in. No more UDFs, folks!

This is going to be an in-your-face introduction to Scalding, Twitter’s (Scala + Cascading) MapReduce framework.

In 140: instead of forcing you to write raw map and reduce functions, Scalding allows you to write natural code like

Not much different from the Ruby you’d write to compute tweet distributions over small data? Exactly.

Two notes before we begin:

This Github repository contains all the code used.
For a gentler introduction to Scalding, see this Getting Started guide on the Scalding wiki.

Movie Similarities

Imagine you run an online movie business, and you want to generate movie recommendations. You have a rating system (people can rate movies with 1 to 5 stars), and we’ll assume for simplicity that all of the ratings are stored in a TSV file somewhere.

Let’s start by reading the ratings into a Scalding job.

You want to calculate how similar pairs of movies are, so that if someone watches The Lion King, you can recommend films like Toy Story. So how should you define the similarity between two movies?

One way is to use their correlation:

For every pair of movies A and B, find all the people who rated both A and B.
Use these ratings to form a Movie A vector and a Movie B vector.
Calculate the correlation between these two vectors.
Whenever someone watches a movie, you can then recommend the movies most correlated with it.

Let’s start with the first two steps.

Before using these rating pairs to calculate correlation, let’s stop for a bit.

Since we’re explicitly thinking of movies as vectors of ratings, it’s natural to compute some very vector-y things like norms and dot products, as well as the length of each vector and the sum over all elements in each vector. So let’s compute these:

To summarize, each row in vectorCalcs now contains the following fields:

movie, movie2
numRaters, numRaters2: the total number of people who rated each movie
size: the number of people who rated both movie and movie2
dotProduct: dot product between the movie vector (a vector of ratings) and the movie2 vector (also a vector of ratings)
ratingSum, rating2sum: sum over all elements in each ratings vector
ratingNormSq, rating2Normsq: squared norm of each vector

So let’s go back to calculating the correlation between movie and movie2. We could, of course, calculate correlation in the standard way: find the covariance between the movie and movie2 ratings, and divide by their standard deviations.

But recall that we can also write correlation in the following form:

$Corr(X, Y) = \frac{n \sum xy - \sum x \sum y}{\sqrt{n \sum x^2 - (\sum x)^2} \sqrt{n \sum y^2 - (\sum y)^2}}$

(See the Wikipedia page on correlation.)

Notice that every one of the elements in this formula is a field in vectorCalcs! So instead of using the standard calculation, we can use this form instead:

And that’s it! To see the full code, check out the Github repository here.

Book Similarities

Let’s run this code over some real data. Unfortunately, I didn’t have a clean source of movie ratings available, so instead I used this dataset of 1 million book ratings.

I ran a quick command, using the handy scald.rb script that Scalding provides…

…and here’s a sample of the top output I got:

As we’d expect, we see that

Harry Potter books are similar to other Harry Potter books
Lord of the Rings books are similar to other Lord of the Rings books
Tom Clancy is similar to John Grisham
Chick lit (Summer Sisters, by Judy Blume) is similar to chick lit (Bridget Jones)

Just for fun, let’s also look at books similar to The Great Gatsby:

(Schoolboy memories, exactly.)

More Similarity Measures

Of course, there are lots of other similarity measures we could use besides correlation.

Cosine Similarity

Cosine similarity is a another common vector-based similarity measure.

Correlation, Take II

We can also also add a regularized correlation, by (say) adding N virtual movie pairs that have zero correlation. This helps avoid noise if some movie pairs have very few raters in common (for example, The Great Gatsby had an unlikely raw correlation of 1 with many other books, due simply to the fact that those book pairs had very few ratings).

Jaccard Similarity

Recall that one of the lessons of the Netflix prize was that implicit data can be quite useful – the mere fact that you rate a James Bond movie, even if you rate it quite horribly, suggests that you’d probably be interested in similar action films. So we can also ignore the value itself of each rating and use a set-based similarity measure like Jaccard similarity.

Incorporation

Finally, let’s add all these similarity measures to our output.

Book Similarities Revisited

Let’s take another look at the book similarities above, now that we have these new fields.

Here are some of the top Book-Crossing pairs, sorted by their shrunk correlation:

Notice how regularization affects things: the Dark Tower pair has a pretty high raw correlation, but relatively few ratings (reducing our confidence in the raw correlation), so it ends up below the others.

And here are books similar to The Great Gatsby, this time ordered by cosine similarity:

Input Abstraction

So our code right now is tied to our specific ratings.tsv input. But what if we change the way we store our ratings, or what if we want to generate similarities for something entirely different?

Let’s abstract away our input. We’ll create a VectorSimilarities class that represents input data in the following format:

Whenever we want to define a new input format, we simply subclass VectorSimilarities and provide a concrete implementation of the input method.

Book-Crossings

For example, here’s a class I could have used to generate the book recommendations above:

The input method simply reads from a TSV file and lets the VectorSimilarities superclass do all the work. Instant recommendations, BOOM.

Song Similarities with Twitter + iTunes

But why limit ourselves to books? We do, after all, have Twitter at our fingertips…

rated Born This Way by Lady GaGa 5 stars itun.es/iSg92N #iTunes
— gggf (@GalMusic92) February 8, 2012

Since iTunes lets you send a tweet whenever you rate a song, we can use these to generate music recommendations!

Again, we create a new class that overrides the abstract input defined in VectorSimilarities…

…and snap! Here are some songs you might like if you recently listened to Beyoncé:

And some recommended songs if you like Lady Gaga:

GG Pandora.

Location Similarities with Foursquare Check-ins

But what if we don’t have explicit ratings? For example, we could be a news site that wants to generate article recommendations, and maybe we only have user visits on each story.

Or what if we want to generate restaurant or tourist recommendations, when all we know is who visits each location?

I’m at Empire State Building (350 5th Ave., btwn 33rd & 34th St., New York) 4sq.com/zZ5xGd
— Simon Ackerman (@SimonAckerman) February 8, 2012

Let’s finally make Foursquare check-ins useful. (I kid, I kid.)

Instead of using an explicit rating given to us, we can simply generate a dummy rating of 1 for each check-in. Correlation doesn’t make sense any more, but we can still pay attention to a measure like Jaccard simiilarity.

So we simply create a new class that scrapes tweets for Foursquare check-in information…

…and bam! Here are locations similar to the Empire State Building:

Here are places you might want to check out, if you check-in at Bergdorf Goodman:

And here’s where to go after the Statue of Liberty:

Power of Twitter, yo.

RottenTomatoes Similarities

UPDATE: I found some movie data after all…

My review for ‘How to Train Your Dragon’ on Rotten Tomatoes: 4 1/2 stars >bit.ly/xtw3d3
— Benjamin West (@BenTheWest) February 21, 2012

So let’s use RottenTomatoes tweets to recommend movies! Here’s the code for a class that searches for RottenTomatoes tweets:

And here are the most similar movies discovered:

We see that

Lord of the Rings, Harry Potter, and Star Wars movies are similar to other Lord of the Rings, Harry Potter, and Star Wars movies
Big science fiction blockbusters (Avatar) are similar to big science fiction blockbusters (Inception)
People who like one Justin Timberlake movie (Bad Teacher) also like other Justin Timberlake Movies (In Time). Similarly with Michael Fassbender (A Dangerous Method, Shame)
Art house movies (The Tree of Life) stick together (Tinker Tailor Soldier Spy)

Let’s also look at the movies with the most negative correlation:

(The more you like loud and dirty popcorn movies (Thor) and vamp romance (Twilight), the less you like arthouse? SGTM.)

Next Steps

Hopefully I gave you a taste of the awesomeness of Scalding. To learn even more:

Check out Scalding on Github.
Read this Getting Started Guide on the Scalding wiki.
Run through this code-based introduction, complete with Scalding jobs that you can run in local mode.
Browse the API reference, which also contains many code snippets illustrating different Scalding functions (e.g., map, filter, flatMap, groupBy, count, join).
And all the code for this post is here.

Watch out for more documentation soon, and you should most definitely follow @Scalding on Twitter for updates or to ask any questions.

Mad Props

And finally, a huge shoutout to Argyris Zymnis, Avi Bryant, and Oscar Boykin, the mastermind hackers who have spent (and continue spending!) unimaginable hours making Scalding a joy to use.

@argyris, @avibryant, @posco: Thanks for it all. #awesomejobguys #loveit

Quick Introduction to ggplot2

2012-01-17T04:15:00+01:00

This is a bare-bones introduction to ggplot2, a visualization package in R. It assumes no knowledge of R.

For a better-looking version of this post, see this Github repository, which also contains some of the example datasets I use and a literate programming version of this tutorial.

Preview

Let’s start with a preview of what ggplot2 can do.

Given Fisher’s iris data set and one simple command…

qplot(Sepal.Length, Petal.Length, data = iris, color = Species)
  

…we can produce this plot of sepal length vs. petal length, colored by species.

Installation

You can download R here. After installation, you can launch R in interactive mode by either typing R on the command line or opening the standard GUI (which should have been included in the download).

R Basics

Vectors

Vectors are a core data structure in R, and are created with c(). Elements in a vector must be of the same type.

numbers = c(23, 13, 5, 7, 31)
  names = c("edwin", "alice", "bob")
  

Elements are indexed starting at 1, and are accessed with [] notation.

numbers[1] # 23
  names[1] # edwin
  

Data frames

Data frames are like matrices, but with named columns of different types (similar to database tables).

books = data.frame(
    title = c("harry potter", "war and peace", "lord of the rings"), # column named "title"
    author = c("rowling", "tolstoy", "tolkien"),
    num_pages = c("350", "875", "500")
  )
  

You can access columns of a data frame with $.

books$title # c("harry potter", "war and peace", "lord of the rings")
  books$author[1] # "rowling"
  

You can also create new columns with $.

books$num_bought_today = c(10, 5, 8)
  books$num_bought_yesterday = c(18, 13, 20)
    
  books$total\_num\_bought = books$num_bought_today + books$num_bought_yesterday
  

read.table

Suppose you want to import a TSV file into R as a data frame.

tsv file without header

For example, consider the data/students.tsv file (with columns describing each student’s age, test score, and name).

 100 alice
 95  bob
 82  eve
  

We can import this file into R using read.table().

students = read.table("data/students.tsv",
    header = F, # file does not contain a header (`F` is short for `FALSE`), so we must manually specify column names                    
    sep = "\t", # file is tab-delimited        
    col.names = c("age", "score", "name") # column names
  )
  

We can now access the different columns in the data frame with students$age, students$score, and students$name.

csv file with header

For an example of a file in a different format, look at the data/studentsWithHeader.tsv file.

age,score,name
  13,100,alice
  14,95,bob
  13,82,eve
  

Here we have the same data, but now the file is comma-delimited and contains a header. We can import this file with

students = read.table("data/students.tsv",
    sep = ",",
    header = T  # first line contains column names, so we can immediately call `students$age`        
  )
  

(Note: there is also a read.csv function that uses sep = "," by default.)

help

There are many more options that read.table can take. For a list of these, just type help(read.table) (or ?read.table) at the prompt to access documentation.

# These work for other functions as well.
  help(read.table)
  ?read.table
  

ggplot2

With these R basics in place, let’s dive into the ggplot2 package.

Installation

One of R’s greatest strengths is its excellent set of packages. To install a package, you can use the install.packages() function.

install.packages("ggplot2")
  

To load a package into your current R session, use library().

library(ggplot2)
  

Scatterplots with qplot()

Let’s look at how to create a scatterplot in ggplot2. We’ll use the iris data frame that’s automatically loaded into R.

What does the data frame contain? We can use the head function to look at the first few rows.

head(iris) # by default, head displays the first 6 rows. see `?head`
  head(iris, n = 10) # we can also explicitly set the number of rows to display
  
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
           5.1         3.5          1.4         0.2  setosa
           4.9         3.0          1.4         0.2  setosa
           4.7         3.2          1.3         0.2  setosa
           4.6         3.1          1.5         0.2  setosa
           5.0         3.6          1.4         0.2  setosa
           5.4         3.9          1.7         0.4  setosa
  

(The data frame actually contains three types of species: setosa, versicolor, and virginica.)

Let’s plot Sepal.Length against Petal.Length using ggplot2’s qplot() function.

qplot(Sepal.Length, Petal.Length, data = iris)
  # Plot Sepal.Length vs. Petal.Length, using data from the `iris` data frame.
  # * First argument `Sepal.Length` goes on the x-axis.
  # * Second argument `Petal.Length` goes on the y-axis.
  # * `data = iris` means to look for this data in the `iris` data frame.    
  

To see where each species is located in this graph, we can color each point by adding a color = Species argument.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species) # dude!
  

Similarly, we can let the size of each point denote sepal width, by adding a size = Sepal.Width argument.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width)
  # We see that Iris setosa flowers have the narrowest petals.
  

qplot(Sepal.Length, Petal.Length, data = iris, color = Species, size = Petal.Width, alpha = I(0.7))
  # By setting the alpha of each point to 0.7, we reduce the effects of overplotting.
  

Finally, let’s fix the axis labels and add a title to the plot.

qplot(Sepal.Length, Petal.Length, data = iris, color = Species,
    xlab = "Sepal Length", ylab = "Petal Length",
    main = "Sepal vs. Petal Length in Fisher's Iris data")
  

Other common geoms

In the scatterplot examples above, we implicitly used a point geom, the default when you supply two arguments to qplot().

# These two invocations are equivalent.
  qplot(Sepal.Length, Petal.Length, data = iris, geom = "point")
  qplot(Sepal.Length, Petal.Length, data = iris)
  

But we can also easily use other types of geoms to create more kinds of plots.

Barcharts: geom = “bar”

movies = data.frame(
    director = c("spielberg", "spielberg", "spielberg", "jackson", "jackson"),
    movie = c("jaws", "avatar", "schindler's list", "lotr", "king kong"),
    minutes = c(124, 163, 195, 600, 187)
  )
  
  # Plot the number of movies each director has.
  qplot(director, data = movies, geom = "bar", ylab = "# movies")
  # By default, the height of each bar is simply a count.
  

# But we can also supply a different weight.
  # Here the height of each bar is the total running time of the director's movies.
  qplot(director, weight = minutes, data = movies, geom = "bar", ylab = "total length (min.)")
  

Line charts: geom = “line”

qplot(Sepal.Length, Petal.Length, data = iris, geom = "line", color = Species)
  # Using a line geom doesn't really make sense here, but hey.
  

# `Orange` is another built-in data frame that describes the growth of orange trees.
  qplot(age, circumference, data = Orange, geom = "line",
    colour = Tree,
    main = "How does orange tree circumference vary with age?")
  

# We can also plot both points and lines.
  qplot(age, circumference, data = Orange, geom = c("point", "line"), colour = Tree)
  

And that’s it with what I’ll cover.

Next Steps

I skipped over a lot of aspects of R and ggplot2 in this intro.

For example,

There are many geoms (and other functionalities) in ggplot2 that I didn’t cover, e.g., boxplots and histograms.
I didn’t talk about ggplot2’s layering system, or the grammar of graphics it’s based on.

So I’ll end with some additional resources on R and ggplot2.

I don’t use it myself, but RStudio is a popular IDE for R.
The official ggplot2 documentation is great and has lots of examples. There’s also an excellent book.
plyr is another fantastic R package that’s also by Hadley Wickham (the author of ggplot2).
The official R introduction is okay, but definitely not great. I haven’t found any R tutorials I really like, but I’ve heard good things about The Art of R Programming.

Introduction to Conditional Random Fields

2012-01-03T00:00:00+01:00

Imagine you have a sequence of snapshots from a day in Justin Bieber’s life, and you want to label each image with the activity it represents (eating, sleeping, driving, etc.). How can you do this?

One way is to ignore the sequential nature of the snapshots, and build a per-image classifier. For example, given a month’s worth of labeled snapshots, you might learn that dark images taken at 6am tend to be about sleeping, images with lots of bright colors tend to be about dancing, images of cars are about driving, and so on.

By ignoring this sequential aspect, however, you lose a lot of information. For example, what happens if you see a close-up picture of a mouth – is it about singing or eating? If you know that the previous image is a picture of Justin Bieber eating or cooking, then it’s more likely this picture is about eating; if, however, the previous image contains Justin Bieber singing or dancing, then this one probably shows him singing as well.

Thus, to increase the accuracy of our labeler, we should incorporate the labels of nearby photos, and this is precisely what a conditional random field does.

Part-of-Speech Tagging

Let’s go into some more detail, using the more common example of part-of-speech tagging.

In POS tagging, the goal is to label a sentence (a sequence of words or tokens) with tags like ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB, ARTICLE.

For example, given the sentence “Bob drank coffee at Starbucks”, the labeling might be “Bob (NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION) Starbucks (NOUN)”.

So let’s build a conditional random field to label sentences with their parts of speech. Just like any classifier, we’ll first need to decide on a set of feature functions $f_i$.

Feature Functions in a CRF

In a CRF, each feature function is a function that takes in as input:

a sentence s
the position i of a word in the sentence
the label $l_i$ of the current word
the label $l_{i-1}$ of the previous word

and outputs a real-valued number (though the numbers are often just either 0 or 1).

(Note: by restricting our features to depend on only the current and previous labels, rather than arbitrary labels throughout the sentence, I’m actually building the special case of a linear-chain CRF. For simplicity, I’m going to ignore general CRFs in this post.)

For example, one possible feature function could measure how much we suspect that the current word should be labeled as an adjective given that the previous word is “very”.

Features to Probabilities

Next, assign each feature function $f_j$ a weight $\lambda_j$ (I’ll talk below about how to learn these weights from the data). Given a sentence s, we can now score a labeling l of s by adding up the weighted features over all words in the sentence:

$score(l | s) = \sum_{j = 1}^m \sum_{i = 1}^n \lambda_j f_j(s, i, l_i, l_{i-1})$

(The first sum runs over each feature function $j$, and the inner sum runs over each position $i$ of the sentence.)

Finally, we can transform these scores into probabilities $p(l | s)$ between 0 and 1 by exponentiating and normalizing:

$p(l | s) = \frac{exp[score(l|s)]}{\sum_{l’} exp[score(l’|s)]} = \frac{exp[\sum_{j = 1}^m \sum_{i = 1}^n \lambda_j f_j(s, i, l_i, l_{i-1})]}{\sum_{l’} exp[\sum_{j = 1}^m \sum_{i = 1}^n \lambda_j f_j(s, i, l’_i, l’_{i-1})]}$

Example Feature Functions

So what do these feature functions look like? Examples of POS tagging features could include:

$f_1(s, i, l_i, l_{i-1}) = 1$ if $l_i =$ ADVERB and the ith word ends in “-ly”; 0 otherwise. ** If the weight $\lambda_1$ associated with this feature is large and positive, then this feature is essentially saying that we prefer labelings where words ending in -ly get labeled as ADVERB.
$f_2(s, i, l_i, l_{i-1}) = 1$ if $i = 1$, $l_i =$ VERB, and the sentence ends in a question mark; 0 otherwise. ** Again, if the weight $\lambda_2$ associated with this feature is large and positive, then labelings that assign VERB to the first word in a question (e.g., “Is this a sentence beginning with a verb?”) are preferred.
$f_3(s, i, l_i, l_{i-1}) = 1$ if $l_{i-1} =$ ADJECTIVE and $l_i =$ NOUN; 0 otherwise. ** Again, a positive weight for this feature means that adjectives tend to be followed by nouns.
$f_4(s, i, l_i, l_{i-1}) = 1$ if $l_{i-1} =$ PREPOSITION and $l_i =$ PREPOSITION. ** A negative weight $\lambda_4$ for this function would mean that prepositions don’t tend to follow prepositions, so we should avoid labelings where this happens.

And that’s it! To sum up: to build a conditional random field, you just define a bunch of feature functions (which can depend on the entire sentence, a current position, and nearby labels), assign them weights, and add them all together, transforming at the end to a probability if necessary.

Now let’s step back and compare CRFs to some other common machine learning techniques.

Smells like Logistic Regression…

The form of the CRF probabilities $p(l | s) = \frac{exp[\sum_{j = 1}^m \sum_{i = 1}^n f_j(s, i, l_i, l_{i-1})]}{\sum_{l’} exp[\sum_{j = 1}^m \sum_{i = 1}^n f_j(s, i, l’_i, l’_{i-1})]}$ might look familiar.

That’s because CRFs are indeed basically the sequential version of logistic regression: whereas logistic regression is a log-linear model for classification, CRFs are a log-linear model for sequential labels.

Looks like HMMs…

Recall that Hidden Markov Models are another model for part-of-speech tagging (and sequential labeling in general). Whereas CRFs throw any bunch of functions together to get a label score, HMMs take a generative approach to labeling, defining

$p(l,s) = p(l_1) \prod_i p(l_i | l_{i-1}) p(w_i | l_i)$

where

$p(l_i | l_{i-1})$ are transition probabilities (e.g., the probability that a preposition is followed by a noun);
$p(w_i | l_i)$ are emission probabilities (e.g., the probability that a noun emits the word “dad”).

So how do HMMs compare to CRFs? CRFs are more powerful – they can model everything HMMs can and more. One way of seeing this is as follows.

Note that the log of the HMM probability is $\log p(l,s) = \log p(l_0) + \sum_i \log p(l_i | l_{i-1}) + \sum_i \log p(w_i | l_i)$. This has exactly the log-linear form of a CRF if we consider these log-probabilities to be the weights associated to binary transition and emission indicator features.

That is, we can build a CRF equivalent to any HMM by…

For each HMM transition probability $p(l_i = y | l_{i-1} = x)$, define a set of CRF transition features of the form $f_{x,y}(s, i, l_i, l_{i-1}) = 1$ if $l_i = y$ and $l_{i-1} = x$. Give each feature a weight of $w_{x,y} = \log p(l_i = y | l_{i-1} = x)$.
Similarly, for each HMM emission probability $p(w_i = z | l_{i} = x)$, define a set of CRF emission features of the form $g_{x,y}(s, i, l_i, l_{i-1}) = 1$ if $w_i = z$ and $l_i = x$. Give each feature a weight of $w_{x,z} = \log p(w_i = z | l_i = x)$.

Thus, the score $p(l|s)$ computed by a CRF using these feature functions is precisely proportional to the score computed by the associated HMM, and so every HMM is equivalent to some CRF.

However, CRFs can model a much richer set of label distributions as well, for two main reasons:

CRFs can define a much larger set of features. Whereas HMMs are necessarily local in nature (because they’re constrained to binary transition and emission feature functions, which force each word to depend only on the current label and each label to depend only on the previous label), CRFs can use more global features. For example, one of the features in our POS tagger above increased the probability of labelings that tagged the first word of a sentence as a VERB if the end of the sentence contained a question mark.
CRFs can have arbitrary weights. Whereas the probabilities of an HMM must satisfy certain constraints (e.g., $0 <= p(w_i | l_i) <= 1, \sum_w p(w_i = w | l_1) = 1)$, the weights of a CRF are unrestricted (e.g., $\log p(w_i | l_i)$ can be anything it wants).

Learning Weights

Let’s go back to the question of how to learn the feature weights in a CRF. One way, unsurprisingly, is to use gradient descent.

Assume we have a bunch of training examples (sentences and associated part-of-speech labels). Randomly initialize the weights of our CRF model. To shift these randomly initialized weights to the correct ones, for each training example…

Go through each feature function $f_i$, and calculate the gradient of the log probability of the training example with respect to $\lambda_i$: $\frac{\partial}{\partial w_j} \log p(l | s) = \sum_{j = 1}^m f_i(s, j, l_j, l_{j-1}) - \sum_{l’} p(l’ | s) \sum_{j = 1}^m f_i(s, j, l’_j, l’_{j-1})$
Note that the first term in the gradient is the contribution of feature $f_i$ under the true label, and the second term in the gradient is the expected contribution of feature $f_i$ under the current model. This is exactly the form you’d expect gradient ascent to take.
Move $\lambda_i$ in the direction of the gradient: $\lambda_i = \lambda_i + \alpha [\sum_{j = 1}^m f_i(s, j, l_j, l_{j-1}) - \sum_{l’} p(l’ | s) \sum_{j = 1}^m f_i(s, j, l’_j, l’_{j-1})]$ where $\alpha$ is some learning rate.
Repeat the previous steps until some stopping condition is reached (e.g., the updates fall below some threshold).

In other words, every step takes the difference between what we want the model to learn and the model’s current state, and moves $\lambda_i$ in the direction of this difference.

Finding the Optimal Labeling

Suppose we’ve trained our CRF model, and now a new sentence comes in. How do we do label it?

The naive way is to calculate $p(l | s)$ for every possible labeling l, and then choose the label that maximizes this probability. However, since there are $k^m$ possible labels for a tag set of size k and a sentence of length m, this approach would have to check an exponential number of labels.

A better way is to realize that (linear-chain) CRFs satisfy an optimal substructure property that allows us to use a (polynomial-time) dynamic programming algorithm to find the optimal label, similar to the Viterbi algorithm for HMMs.

A More Interesting Application

Okay, so part-of-speech tagging is kind of boring, and there are plenty of existing POS taggers out there. When might you use a CRF in real life?

Suppose you want to mine Twitter for the types of presents people received for Christmas:

What people on Twitter wanted for Christmas, and what they got: twitter.com/edchedch/statu…
— Edwin Chen (@echen) January 2, 2012

How can you figure out which words refer to gifts?

To gather data for the graphs above, I simply looked for phrases of the form “I want XXX for Christmas” and “I got XXX for Christmas”. However, a more sophisticated CRF variant could use a GIFT part-of-speech-like tag (even adding other tags like GIFT-GIVER and GIFT-RECEIVER, to get even more information on who got what from whom) and treat this like a POS tagging problem. Features could be based around things like “this word is a GIFT if the previous word was a GIFT-RECEIVER and the word before that was ‘gave’” or “this word is a GIFT if the next two words are ‘for Christmas’”.

Winning the Netflix Prize: A Summary

2011-10-24T04:15:00+02:00

How was the Netflix Prize won? I went through a lot of the Netflix Prize papers a couple years ago, so I’ll try to give an overview of the techniques that went into the winning solution here.

Normalization of Global Effects

Suppose Alice rates Inception 4 stars. We can think of this rating as composed of several parts:

A baseline rating (e.g., maybe the mean over all user-movie ratings is 3.1 stars).
An Alice-specific effect (e.g., maybe Alice tends to rate movies lower than the average user, so her ratings are -0.5 stars lower than we normally expect).
An Inception-specific effect (e.g., Inception is a pretty awesome movie, so its ratings are 0.7 stars higher than we normally expect).
A less predictable effect based on the specific interaction between Alice and Inception that accounts for the remainder of the stars (e.g., Alice really liked Inception because of its particular combination of Leonardo DiCaprio and neuroscience, so this rating gets an additional 0.7 stars).

In other words, we’ve decomposed the 4-star rating into: 4 = [3.1 (the baseline rating) - 0.5 (the Alice effect) + 0.7 (the Inception effect)] + 0.7 (the specific interaction)

So instead of having our models predict the 4-star rating itself, we could first try to remove the effect of the baseline predictors (the first three components) and have them predict the specific 0.7 stars. (I guess you can also think of this as a simple kind of boosting.)

More generally, additional baseline predictors include:

A factor that allows Alice’s rating to (linearly) depend on the (square root of the) number of days since her first rating. (For example, have you ever noticed that you become a harsher critic over time?)
A factor that allows Alice’s rating to depend on the number of days since the movie’s first rating by anyone. (If you’re one of the first people to watch it, maybe it’s because you’re a huge fan and really excited to see it on DVD, so you’ll tend to rate it higher.)
A factor that allows Alice’s rating to depend on the number of people who have rated Inception. (Maybe Alice is a hipster who hates being part of the crowd.)
A factor that allows Alice’s rating to depend on the movie’s overall rating.
(Plus a bunch of others.)

And, in fact, modeling these biases turned out to be fairly important: in their paper describing their final solution to the Netflix Prize, Bell and Koren write that

Of the numerous new algorithmic contributions, I would like to highlight one – those humble baseline predictors (or biases), which capture main effects in the data. While the literature mostly concentrates on the more sophisticated algorithmic aspects, we have learned that an accurate treatment of main effects is probably at least as signficant as coming up with modeling breakthroughs.

(For a perhaps more concrete example of why removing these biases is useful, suppose you know that Bob likes the same kinds of movies that Alice does. To predict Bob’s rating of Inception, instead of simply predicting the same 4 stars that Alice rated, if we know that Bob tends to rate movies 0.3 stars higher than average, then we could first remove Alice’s bias and then add in Bob’s: 4 + 0.5 + 0.3 = 4.8.)

Neighborhood Models

Let’s now look at some slightly more sophisticated models. As alluded to in the section above, one of the standard approaches to collaborative filtering is to use neighborhood models.

Briefly, a neighborhood model works as follows. To predict Alice’s rating of Titanic, you could do two things:

Item-item approach: find a set of items similar to Titanic that Alice has also rated, and take the (weighted) mean of Alice’s ratings on them.
User-user approach: find a set of users similar to Alice who rated Titanic, and again take the mean of their ratings of Titanic.

(See also my post on item-to-item collaborative filtering on Amazon.)

The main questions, then, are (let’s stick to the item-item approach for simplicity):

How do we find the set of similar items?
How do we weight these items when taking their mean?

The standard approach is to take some similarity metric (e.g., correlation or a Jaccard index) to define similarities between pairs of movies, take the K most similar movies under this metric (where K is perhaps chosen via cross-validation), and then use the same similarity metric when computing the weighted mean.

This has a couple problems:

Neighbors aren’t independent, so using a standard similarity metric to define a weighted mean overcounts information. For example, suppose you ask five friends where you should eat tonight. Three of them went to Mexico last week and are sick of burritos, so they strongly recommend against a taqueria. Thus, your friends’ recommendations have a stronger bias than what you’d get if you asked five friends who didn’t know each other at all. (Compare with the situation where all three Lord of the Rings Movies are neighbors of Harry Potter.)
Different movies should perhaps be using different numbers of neighbors. Some movies may be predicted well by only one neighbor (e.g., Harry Potter 2 could be predicted well by Harry Potter 1 alone), some movies may require more, and some movies may have no good neighbors (so you should ignore your neighborhood algorithms entirely and let your other ratings models stand on their own).

So another approach is the following:

You can still use a similarity metric like correlation or cosine similarity to choose the set of similar items.
But instead of using the similarity metric to define the interpolation weights in the mean calculations, you essentially perform a (sparse) linear regression to find the weights that minimize the squared error between an item’s rating and a linear combination of the ratings of its neighbors. Note that these weights are no longer constrained, so that if all neighbors are weak, then their weights will be close to zero and the neighborhood model will have a low effect.

(A slightly more complicated user-user approach, similar to this item-item neighborhood approach, is also useful.)

Implicit Data

Adding on to the neighborhood approach, we can also let implicit data influence our predictions. The mere fact that a user rated lots of science fiction movies but no westerns, suggests that the user likes science fiction better than cowboys. So using a similar framework as in the neighborhood ratings model, we can learn for Inception a set of offset weights associated to Inception’s movie neighbors.

Whenever we want to predict how Bob rates Inception, we look at whether Bob rated each of Inception’s neighbors. If he did, we add in the corresponding offset; if not, then we add nothing (and, thus, Bob’s rating is implicitly penalized by the missing weight).

Matrix Factorization

Complementing the neighborhood approach to collaborative filtering is the matrix factorization approach. Whereas the neighborhood approach takes a very local approach to ratings (if you liked Harry Potter 1, then you’ll like Harry Potter 2!), the factorization approach takes a more global view (we know that you like fantasy movies and that Harry Potter has a strong fantasy element, so we think that you’ll like Harry Potter) that decomposes users and movies into a set of latent factors (which we can think of as categories like “fantasy” or “violence”).

In fact, matrix factorization methods were probably the most important class of techniques for winning the Netflix Prize. In their 2008 Progress Prize paper, Bell and Koren write

It seems that models based on matrix-factorization were found to be most accurate (and thus popular), as evident by recent publications and discussions on the Netflix Prize forum. We definitely agree to that, and would like to add that those matrix-factorization models also offer the important flexibility needed for modeling temporal effects and the binary view. Nonetheless, neighborhood models, which have been dominating most of the collaborative filtering literature, are still expected to be popular due to their practical characteristics - being able to handle new users/ratings without re-training and offering direct explanations to the recommendations.

The typical way to perform matrix factorizations is to perform a singular value decomposition on the (sparse) ratings matrix (using stochastic gradient descent and regularizing the weights of the factors, possibly constraining the weights to be positive to get a type of non-negative matrix factorization). (Note that this “SVD” is a little different from the standard SVD learned in linear algebra, since not every user has rated every movie and so the ratings matrix contains many missing elements that we don’t want to simply treat as 0.)

Some SVD-inspired methods used in the Netflix Prize include:

Standard SVD: Once you’ve represented users and movies as factor vectors, you can dot product Alice’s vector with Inception’s vector to get Alice’s predicted rating of Inception.
Asymmetric SVD: Instead of users having their own notion of factor vectors, we can represent users as a bag of items they have rated (or provided implicit feedback for). So Alice is now represented as a (possibly weighted) sum of the factor vectors of the items she has rated, and to get her predicted rating of Titanic, we can dot product this representation with the factor vector of Titanic. From a practical perspective, this model has an added benefit in that no user parameterizations are needed, so we can use this approach to generate recommendations as soon as a user provides some feedback (which could just be views or clicks on an item, and not necessarily ratings), without needing to retrain the model to factorize the user.
SVD++: Incorporate both the standard SVD and the asymmetric SVD model by representing users both by their own factor representation and as a bag of item vectors.

Regression

Some regression models were also used in the predictions. The models are fairly standard, I think, so I won’t spend too long here. Basically, just as with the neighborhood models, we can take a user-centric approach and a movie-centric approach to regression:

User-centric approach: We learn a regression model for each user, using all the movies that the user rated as the dataset. The response is the movie’s rating, and the predictor variables are attributes associated to that movie (which can be derived from, say, PCA, MDS, or an SVD).
Movie-centric approach: Similarly, we can learn a regression model for each movie, using all the users that rated the movie as the dataset.

Restricted Boltzmann Machines

Restricted Boltzmann Machines provide another kind of latent factor approach that can be used. See this paper for a description of how to apply them to the Netflix Prize. (In case the paper’s a little difficult to read, I wrote an introduction to RBMs a little while ago.)

Temporal Effects

Many of the models incorporate temporal effects. For example, when describing the baseline predictors above, we used a few temporal predictors that allowed a user’s rating to (linearly) depend on the time since the first rating he ever made and on the time since a movie’s first rating. We can also get more fine-grained temporal effects by, say, binning items into a couple months’ worth of ratings at a time, and allowing movie biases to change within each bin. (For example, maybe in May 2006, Time Magazine nominated Titanic as the best movie ever made, which caused a spurt in glowing ratings around that time.)

In the matrix factorization approach, user factors were also allowed to be time-dependent (e.g., maybe Bob comes to like comedy movies more and more over time). We can also give more weight to recent user actions.

Regularization

Regularization was also applied throughout pretty much all the models learned, to prevent overfitting on the dataset. Ridge regression was heavily used in the factorization models to penalize large weights, and lasso regression (though less effective) was useful as well. Many other parameters (e.g., the baseline predictors, similarity weights and interpolation weights in the neighborhood models) were also estimated using fairly standard shrinkage techniques.

Ensemble Methods

Finally, let’s talk about how all of these different algorithms were combined to provide a single rating that exploits the strengths of each model. (Note that, as mentioned above, many of these models were not trained on the raw ratings data directly, but rather on the residuals of other models.)

In the paper detailing their final solution, the winners describe using gradient boosted decision trees to combine over 500 models; previous solutions used instead a linear regression to combine the predictors.

Briefly, gradient boosted decision trees work by sequentially fitting a series of decision trees to the data; each tree is asked to predict the error made by the previous trees, and is often trained on slightly perturbed versions of the data. (For a longer description of a similar technique, see my introduction to random forests.)

Since GBDTs have a built-in ability to apply different methods to different slices of the data, we can add in some predictors that help the trees make useful clusterings:

Number of movies each user rated
Number of users that rated each movie
Factor vectors of users and movies
Hidden units of a restricted Boltzmann Machine

(For example, one thing that Bell and Koren found (when using an earlier ensemble method) was that RBMs are more useful when the movie or the user has a low number of ratings, and that matrix factorization methods are more useful when the movie or user has a high number of ratings.)

Here’s a graph of the effect of ensemble size from early on in the competition (in 2007), and the authors’ take on it:

However, we would like to stress that it is not necessary to have such a large number of models to do well. The plot below shows RMSE as a function of the number of methods used. One can achieve our winning score (RMSE=0.8712) with less than 50 methods, using the best 3 methods can yield RMSE < 0.8800, which would land in the top 10. Even just using our single best method puts us on the leaderboard with an RMSE of 0.8890. The lesson here is that having lots of models is useful for the incremental results needed to win competitions, but practically, excellent systems can be built with just a few well-selected models.

Stuff Harvard People Like

2011-09-29T04:15:00+02:00

What types of students go to which schools? There are, of course, the classic stereotypes:

MIT has the hacker engineers.
Stanford has the laid-back, social folks.
Harvard has the prestigious leaders of the world.
Berkeley has the activist hippies.
Caltech has the hardcore science nerds.

But how well do these perceptions match reality? What are students at Stanford, Harvard, MIT, Caltech, and Berkeley really interested in? Following the path of my previous data-driven post on differences between Silicon Valley and NYC, I scraped the Quora profiles of a couple hundred followers of each school to find out.

Topics

So let’s look at what kinds of topics followers of each school are interested in*. (Skip past the lists for a discussion.)

MIT

Topics are followed by p(school = MIT|topic).

MIT Media Lab 0.893
Ksplice 0.69
Lisp (programming language) 0.677
Nokia 0.659
Public Speaking 0.65
Data Storage 0.65
Google Voice 0.609
Hacking 0.602
Startups in Europe 0.597
Startup Names 0.572
Mechanical Engineering 0.563
Engineering 0.563
Distributed Databases 0.544
StackOverflow 0.536
Boston 0.513
Learning 0.507
Open Source 0.498
Cambridge 0.496
Public Relations 0.493
Visualization 0.492
Semantic Web 0.486
Andreessen-Horowitz 0.483
Nature 0.475
Cryptography 0.474
Startups in Boston 0.452
Adobe Photoshop 0.451
Computer Security 0.447
Sachin Tendulkar 0.443
Hacker News 0.442
Games 0.429
Android Applications 0.428
Best Engineers and Programmers 0.427
College Admissions & Getting Into College 0.422
Co-Founders 0.419
Big Data 0.41
System Administration 0.4
Biotechnology 0.398
Higher Education 0.394
NoSQL 0.387
User Experience 0.386
Career Advice 0.377
Artificial Intelligence 0.375
Scalability 0.37
Taylor Swift 0.368
Google Search 0.368
Functional Programming 0.365
Bing 0.363
Bioinformatics 0.361
How I Met Your Mother (TV series) 0.361
Operating Systems 0.356
Compilers 0.355
Google Chrome 0.354
Management & Organizational Leadership 0.35
Literary Fiction 0.35
Intelligence 0.348
Fight Club (1999 movie) 0.344
Hip Hop Music 0.34
UX Design 0.337
Web Application Frameworks 0.336
Startups in New York City 0.333
Book Recommendations 0.33
Engineering Recruiting 0.33
Search Engines 0.329
Social Search 0.329
Data Science 0.328
History 0.328
Interaction Design 0.326
Classification (machine learning) 0.322
Startup Incubators and Seed Programs 0.321
Graphic Design 0.321
Product Design (software) 0.319
The College Experience 0.319
Writing 0.319
MapReduce 0.318
Database Systems 0.315
User Interfaces 0.314
Literature 0.314
C (programming language) 0.314
Television 0.314
Reading 0.313
Usability 0.312
Books 0.312
Computers 0.311
Stealth Startups 0.311
Daft Punk 0.31
Healthy Eating 0.309
Innovation 0.309
Skiing 0.305
JavaScript 0.304
Rock Music 0.304
Mozilla Firefox 0.304
Self-Improvement 0.303
McKinsey & Company 0.302
AngelList 0.301
Data Visualization 0.301
Cassandra (database) 0.301

Stanford

Topics are followed by p(school = Stanford|topic).

Stanford Computer Science 0.951
Stanford Graduate School of Business 0.939
Stanford 0.896
Stanford Football 0.896
Stanford Cardinal 0.896
Social Dance 0.847
Stanford University Courses 0.847
Romance 0.769
Instagram 0.745
College Football 0.665
Mobile Location Applications 0.634
Online Communities 0.621
Interpersonal Relationships 0.585
Food & Restaurants in Palo Alto 0.572
Your 20s 0.566
Men’s Fashion 0.548
Flipboard 0.537
Inception (2010 movie) 0.535
Tumblr 0.531
People Skills 0.522
Exercise 0.52
Joel Spolsky 0.516
Valuations 0.515
The Social Network (2010 movie) 0.513
LeBron James 0.506
Northern California 0.506
Evernote 0.5
Quora Community 0.5
Blogging 0.49
Downtown Palo Alto 0.487
The College Experience 0.485
Consumer Internet 0.477
Restaurants in San Francisco 0.477
Chad Hurley 0.47
Meditation 0.468
Yishan Wong 0.466
Arrested Development (TV series) 0.463
fbFund 0.457
Best Engineers at X Company 0.451
Language 0.45
Words 0.448
Happiness 0.447
Path (company) 0.446
Color Labs (startup) 0.446
Palo Alto 0.445
Woot.com 0.442
Beer 0.442
PayPal 0.441
Women in Startups 0.438
Techmeme 0.433
Women in Engineering 0.428
The Mission (San Francisco neighborhood) 0.427
iPhone Applications 0.416
Asana 0.413
Monetization 0.412
Repetitive Strain Injury (RSI) 0.4
IDEO 0.398
Spotify 0.397
San Francisco Giants 0.396
Fortune Magazine 0.389
Love 0.387
Human-Computer Interaction 0.382
Hip Hop Music 0.378
Self-Improvement 0.378
Food in San Francisco 0.375
Quora (company) 0.374
Quora Infrastructure 0.373
iPhone 0.371
Square (company) 0.369
Social Psychology 0.369
Network Effects 0.366
Chris Sacca 0.365
Walt Mossberg 0.364
Salesforce.com 0.362
Sex 0.361
Etiquette 0.361
David Pogue 0.361
Gowalla 0.36
iOS Development 0.354
Palantir Technologies 0.353
Mobile Computing 0.347
Sports 0.346
Video Games 0.345
Burning Man 0.345
Engineering Management 0.343
Cognitive Science 0.342
Dating & Relationships 0.341
Fred Wilson (venture investor) 0.337
Taiwan 0.333
Natural Language Processing 0.33
Eric Schmidt 0.329
Social Advice 0.329
Engineering Recruiting 0.328
Job Interviews 0.325
Mobile Phones 0.324
Twitter Inc. (company) 0.321
Engineering in Silicon Valley 0.321
San Francisco Bay Area 0.321
Google Analytics 0.32
Fashion 0.315
Interaction Design 0.314
Open Graph 0.313
Drugs & Pharmaceuticals 0.312
Electronic Music 0.312
Facebook Inc. (company) 0.309
Fitness 0.309
YouTube 0.308
TED Talks 0.308
Freakonomics (2005 Book) 0.307
Jack Dorsey 0.306
Nutrition 0.305
Puzzles 0.305
Silicon Valley Mergers & Acquisitions 0.304
Viral Growth & Analytics 0.304
Amazon Web Services 0.304
StumbleUpon 0.303
Exceptional Comment Threads 0.303

Harvard

Harvard Business School 0.968
Harvard Business Review 0.922
Harvard Square 0.912
Harvard Law School 0.912
Jimmy Fallon 0.899
Boston Red Sox 0.658
Klout 0.644
Oprah Winfrey 0.596
Ivanka Trump 0.587
Dalai Lama 0.569
Food in New York City 0.565
U2 0.562
TwitPic 0.534
37signals 0.522
David Lynch (director) 0.512
Al Gore 0.508
TechStars 0.49
Baseball 0.487
Private Equity 0.471
Classical Music 0.46
Startups in New York City 0.458
HootSuite 0.449
Kiva 0.442
Ultimate Frisbee 0.441
Huffington Post 0.436
New York City 0.433
Charlie Cheever 0.433
The New York Times 0.431
Technology Journalism 0.431
McKinsey & Company 0.427
TweetDeck 0.422
How Does X Work? 0.417
Ashton Kutcher 0.414
Coldplay 0.402
Conan O’Brien 0.397
Fast Company 0.397
WikiLeaks 0.394
Michael Jackson 0.389
Guy Kawasaki 0.389
Journalism 0.384
Wall Street Journal 0.384
Cambridge 0.371
Seattle 0.37
Cities & Metro Areas 0.357
Boston 0.353
Tim Ferriss (author) 0.35
The New Yorker 0.343
Law 0.34
Mashable 0.338
Politics 0.335
The Economist 0.334
Barack Obama 0.333
Skiing 0.329
McKinsey Quarterly 0.325
Wired (magazine) 0.316
Bill Gates 0.31
Mad Men (TV series) 0.308
India 0.306
TED Talks 0.306
Netflix 0.304
Wine 0.303
Angel Investors 0.302
Facebook Ads 0.301

UC Berkeley

Berkeley 0.978
California Golden Bears 0.91
Internships 0.717
Web Marketing 0.484
Google Social Strategy 0.453
Southwest Airlines 0.451
WordPress 0.429
Stock Market 0.429
BMW (automobile) 0.428
Web Applications 0.423
Flickr 0.422
Snowboarding 0.42
Electronic Music 0.404
MySQL 0.401
Internet Advertising 0.399
Search Engine Optimization (SEO) 0.398
Yelp 0.396
Groupon 0.393
In-N-Out Burger 0.391
The Matrix (1999 movie) 0.389
Trading (finance) 0.385
jQuery 0.381
Hedge Funds 0.378
Social Media Marketing 0.377
San Francisco 0.376
Stealth Startups 0.362
Yahoo! 0.36
Cascading Style Sheets 0.359
Angel Investors 0.355
UX Design 0.35
StarCraft 0.348
Los Angeles Lakers 0.347
Mountain View 0.345
How I Met Your Mother (TV series) 0.338
Google+ 0.337
Ruby on Rails 0.333
Reading 0.333
Social Media 0.326
China 0.322
Palantir Technologies 0.319
Facebook Platform 0.315
Basketball 0.315
Education 0.314
Business Development 0.312
Online & Mobile Payments 0.305
Restaurants in San Francisco 0.302
Technology Companies 0.302
Seth Godin 0.3

Caltech

Pasadena 0.969
Chess 0.748
Table Tennis 0.671
UCLA 0.67
MacBook Pro 0.618
Physics 0.618
Haskell 0.582
Los Angeles 0.58
Electrical Engineering 0.567
Star Trek (movie 0.561
Disruptive Technology 0.545
Science 0.53
Biology 0.526
Quantum Mechanics 0.521
LaTeX 0.514
Mathematics 0.488
xkcd 0.488
Genetics & Heredity 0.487
Chemistry 0.47
Medicine & Healthcare 0.448
Poker 0.445
C++ (programming language) 0.442
Data Structures 0.434
Emacs 0.428
MongoDB 0.423
Neuroscience 0.404
Science Fiction 0.4
Mac OS X 0.394
Board Games 0.387
Computers 0.386
Research 0.385
Finance 0.385
The Future 0.379
Linux 0.378
The Colbert Report 0.376
The Beatles 0.374
The Onion 0.365
Ruby 0.363
Cars & Automobiles 0.361
Quantitative Finance 0.359
Academia 0.359
Law 0.355
Cooking 0.354
Psychology 0.349
Eminem 0.347
Football (Soccer) 0.346
Computer Programming 0.343
Algorithms 0.343
Evolutionary Biology 0.337
Behavioral Economics 0.335
California 0.329
Machine Learning 0.326
Futurama 0.324
Social Advice 0.324
StarCraft II 0.319
Job Interview Questions 0.318
Game Theory 0.316
This American Life 0.315
Economics 0.314
Vim 0.31
Graduate School 0.309
Git (revision control) 0.306
Computer Science 0.303

What do we see?

First, in a nice validation of this approach, we find that each school is interested in exactly the locations we’d expect: Caltech is interested in Pasadena and Los Angeles; MIT and Harvard are both interested in Boston and Cambridge (Harvard is interested in New York City as well); Stanford is interested in Palo Alto, Northern California, and San Francisco Bay Area; and Berkeley is interested in Berkeley, San Francisco, and Mountain View.
More interestingly, let’s look at where each school likes to eat. Stereotypically, we expect Harvard, Stanford, and Berkeley students to be more outgoing and social, and MIT and Caltech students to be more introverted. This is indeed what we find:
- Harvard follows Food in New York City; Stanford follows Food & Restaurants in Palo Alto, Restaurants in San Francisco, and Food in San Francisco; and Berkeley follows Restaurants in San Francisco and In-N-Out Burger. In other words, Harvard, Stanford, and Berkeley love eating out.
- Caltech, on the other hand, loves Cooking, and MIT loves Healthy Eating – both signs, perhaps, of a preference for eating in.
And what does each university use to quench their thirst? Harvard students like to drink wine (classy!), while Stanford students prefer beer (the social drink of choice).
What about sports teams? MIT and Caltech couldn’t care less, though Harvard follows the Boston Red Sox, Stanford follows the San Francisco Giants (as well as their own Stanford Football and Stanford Cardinal), and Berkeley follows the Los Angeles Lakers (and the California Golden Bears).
For sports themselves, MIT students like skiing; Stanford students like general exercise, fitness, and sports; Harvard students like baseball, ultimate frisbee, and skiing; and Berkeley students like snowboarding. Caltech, in a league of its own, enjoys table tennis and chess.
What does each school think of social? Caltech students look for Social Advice. Berkeley students are interested in Social Media and Social Media Marketing. MIT, on the more technical side, wants Social Search. Stanford students, predictably, love the whole spectrum of social offerings, from Social Dance and The Social Network, to Social Psychology and Social Advice. (Interestingly, Caltech and Stanford are both interested in Social Advice, though I wonder if it’s for slightly different reasons.)
What’s each school’s relationship with computers? Caltech students are interested in Computer Science, MIT hackers are interested in Computer Security, and Stanford students are interested in Human-Computer Interaction.
Digging into the MIT vs. Caltech divide a little, we see that Caltech students really are more interested in the pure sciences (Physics, Science, Biology, Quantum Mechanics, Mathematics, Chemistry, etc.), while MIT students are more on the applied and engineering sides (Mechanical Engineering, Engineering, Distributed Databases, Cryptography, Computer Security, Biotechnology, Operating Systems, Compilers, etc.).
Regarding programming languages, Caltech students love Haskell (hardcore purity!), while MIT students love Lisp.
What does each school like to read, both offline and online? Caltech loves science fiction, xkcd, and The Onion; MIT likes Hacker News; Harvard loves journals, newspapers, and magazines (Huffington Post, the New York Times, Fortune, Wall Street Journal, the New Yorker, the Economist, and so on); and Stanford likes TechMeme.
What movies and television shows does each school like to watch? Caltech likes Star Trek, the Colbert Report, and Futurama. MIT likes Fight Club (I don’t know what this has to do with MIT, though I will note that on my first day as a freshman in a new dorm, Fight Club was precisely the movie we all went to a lecture hall to see). Stanford likes The Social Network and Inception. Harvard, rather fittingly, likes Mad Men and Ted Talks.
Let’s look at the startups each school follows. MIT, of course, likes Ksplice. Berkeley likes Yelp and Groupon. Stanford likes just about every startup under the sun (Instagram, Flipboard, Tumblr, Path, Color Labs, etc.). And Harvard, that bastion of hard-won influence and prestige? To the surprise of precisely no one, Harvard enjoys Klout.

Let’s end with a summarized view of each school:

Caltech is very much into the sciences (Physics, Biology, Quantum Mechanics, Mathematics, etc.), as well as many pretty nerdy topics (Star Trek, Science Fiction, xkcd, Futurama, Starcraft II, etc.).
MIT is dominated by everything engineering and tech.
Stanford loves relationships (interpersonal relationships, people skills, love, network effects, sex, etiquette, dating and relationships, romance), health and appearance (fashion, fitness, nutrition, happiness), and startups (Instagram, Flipboard, Path, Color Labs, etc.).
Berkeley, sadly, is perhaps too large and diverse for an overall characterization.
Harvard students are fascinated by famous figures (Jimmy Fallon, Oprah Winfrey, Invaka Trump, Dalai Lama, David Lynch, Al Gore, Bill Gates, Barack Obama), and by prestigious newspapers, journals, and magazines (Fortune, the New York Times, the Wall Street Journal, the Economist, and so on). Other very fitting interests include Kiva, classical music, and Coldplay.

*I pulled about 400 followers from each school, and added a couple filters, to try to ensure that followers were actual attendees of the schools rather than general people simply interested in them. Topics are sorted using a naive Bayes score and filtered to have at least 5 counts. Also, a word of warning: my dataset was fairly small and users on Quora are almost certainly not representative of their schools as a whole (though I tried to be rigorous with what I had).

Information Transmission in a Social Network: Dissecting the Spread of a Quora Post

2011-09-07T04:15:00+02:00

tl;dr See this movie visualization for a case study on how a post propagates through Quora.

How does information spread through a network? Much of Quora’s appeal, after all, lies in its social graph – and when you’ve got a network of users, all broadcasting their activities to their neighbors, information can cascade in multiple ways. How do these social designs affect which users see what?

Think, for example, of what happens when your kid learns a new slang word at school. He doesn’t confine his use of the word to McKinley Elementary’s particular boundaries, between the times of 9-3pm – he introduces it to his friends from other schools at soccer practice as well. A couple months later, he even says it at home for the first time; you like the word so much, you then start using it at work. Eventually, Justin Bieber uses the word in a song, at which point the word’s popularity really starts to explode.

So how does information propagate through a social network? What types of people does an answer on Quora reach, and how does it reach them? (Do users discover new answers individually, or are hubs of connectors more key?) How does the activity of a post on Quora rise and fall? (Submissions on other sites have limited lifetimes, fading into obscurity soon after an initial spike; how does that change when users are connected and every upvote can revive a post for someone else’s eyes?)

(I looked at Quora since I had some data from there already available, but I hope the lessons should be fairly applicable in general, to other social networks like Facebook, Twitter, and LinkedIn as well.)

To give an initial answer to some of these questions, I dug into one of my more popular posts, on a layman’s introduction to random forests.

Users, Topics

Before looking deeper into the voting dynamics of the post, let’s first get some background on what kinds of users the answer reached.

Here’s a graph of the topics that question upvoters follow. (Each node is a topic, and every time upvoter X follows both topics A and B, I add an edge between A and B.)

We can see from the graph that upvoters tend to be interested in three kinds of topics:

Machine learning and other technical matters (the green cluster): Classification, Data Mining, Big Data, Information Retrieval, Analytics, Probability, Support Vector Machines, R, Data Science, …
Startups/Silicon Valley (the red cluster): Facebook, Lean Startups, Investing, Seed Funding, Angel Investing, Technology Trends, Product Managment, Silicon Valley Mergers and Acquisitions, Asana, Social Games, Quora, Mark Zuckerberg, User Experience, Founders and Entrepreneurs, …
General Intellectual Topics (the purple cluster): TED, Science, Book Recommendations, Philosophy, Politics, Self-Improvement, Travel, Life Hacks, …

Also, here’s the network of the upvoters themselves (there’s an edge between users A and B if A follows B):

We can see three main clusters of users:

A large group in green centered around a lot of power users and Quora employees.
A machine learning group of folks in orange centered around people like Oliver Grisel, Christian Langreiter, and Joseph Turian.
A group of people following me, in purple.
Plus some smaller clusters in blue and yellow. (There were also a bunch of isolated users, connected to no one, that I filtered out of the picture.)

Digging into how these topic and user graphs are related:

The orange cluster of users is more heavily into machine learning: 79% of users in that cluster follow more green topics (machine learning and technical topics) than red and purple topics (startups and general intellectual matters).
The green cluster of users is reversed: 77% of users follow more of the red and purple clusters of topics (on startups and general intellectual matters) than machine learning and technical topics.

More interestingly, though, we can ask: how do the connections between upvoters relate to the way the post spread?

Social Voting Dynamics

So let’s take a look. Here’s a visualization I made of upvotes on my answer across time (click here for a larger view).

To represent the social dynamics of these upvotes, I drew an edge from user A to user B if user A transmitted the post to user B through an upvote. (Specifically, I drew an edge from Alice to Bob if Bob follows Alice and Bob’s upvote appeared within five days of Alice’s upvote; this is meant to simulate the idea that Alice was the key intermediary between my post and Bob.)

Also,

Green nodes are users with at least one upvote edge.
Blue nodes are users who follow at least one of the topics the post is categorized under (i.e., users who probably discovered the answer by themselves).
Red nodes are users with no connections and who do not follow any of the post’s topics (i.e, users whose path to the post remain mysterious).
Users increase in size when they produce more connections.

Here’s a play-by-play of the video:

On Feb 14 (the day I wrote the answer), there’s a flurry of activity.
A couple of days later, Tracy Chou gives an upvote, leading to another spike in activity.
Then all’s quiet until… bam! Alex Kamil leads to a surge of upvotes, and his upvote finds Ludi Rehak, who starts a small surge of her own. They’re quickly followed by Christian Langreiter, who starts a small revolution among a bunch of machine learning folks a couple days later.
Then all is pretty calm again, until a couple months later when… bam! Aditya Sengupta brings in a smashing of his own followers, and his upvote makes its way to Marc Bodnick, who sets off a veritable storm of activity.

(Already we can see some relationships between the graph of user connections and the way the post propagated. Many of the users from the orange cluster, for example, come from Alex Kamil and Christian Langreiter’s upvotes, and many of the users from the green cluster come from Aditya Sengupta and Marc Bodnick’s upvotes. What’s interesting, though, is, why didn’t the cluster of green users appear all at once, like the orange cluster did? People like Kah Seng Tay, Tracy Chou, Venkatesh Rao, and Chad Little upvoted the answer pretty early on, but it wasn’t until Aditya Sengupta’s upvote a couple months later that people like Marc Bodnick, Edmond Lau, and many of the other green users (who do indeed follow that first set of folks) discovered the answer. Did the post simply get lost in users’ feeds the first time around? Was the post perhaps ignored until it received enough upvotes to be considered worth reading? Are some users’ upvotes just trusted more than others’?)

For another view of the upvote dynamics, here’s a static visualization, where we can again easily see the clusters of activity:

Fin

There are still many questions it would be interesting to look at; for example,

What differentiates users who sparked spikes of activity from users who didn’t? I don’t believe it’s simply number of followers, as many well-connected upvoters did not lead to cascades of shares. Does authority matter?
How far can a post reach? Clearly, the post reached people more than one degree of separation away from me (where one degree of separation is a follower); what does the distribution of degrees look like? Is there any relationship between degree of separation and time of upvote?
What can we say about the people who started following me after reading my answer? Are they fewer degrees of separation away? Are they more interested in machine learning? Have they upvoted any of my answers before? (Perhaps there’s a certain “threshold” of interestingness people need to overflow before they’re considered acceptable followees.)

But to summarize a bit what we’ve seen so far, here are some statistics on the role the social graph played in spreading the post:

There are 5 clusters of activity after the initial post, sparked both by power users and less-connected folks. In an interesting cascade of information, some of these sparks led to further spikes in activity as well (as when Aditya Sengupta’s upvote found its way to Marc Bodnick, who set off even more activity).
35% of users made their way to my answer because of someone else’s upvote.
Through these connections, the post reached a fair variety of users: 32% of upvoters don’t even follow any of the post’s topics.
77% of upvotes came from users over two weeks after my answer appeared.
If we look only at the upvoters who follow at least one of the post’s topics, 33% didn’t see my answer until someone else showed it to them. In other words, a full one-third of people who presumably would have been interested in my post anyways only found it because of their social network.

So it looks like the social graph played quite a large part in the post’s propagation, and I’ll end with a big shoutout to Stormy Shippy, who provided an awesome set of scripts I used to collect a lot of this data.

Introduction to Latent Dirichlet Allocation

2011-08-22T04:15:00+02:00

Introduction

Suppose you have the following set of sentences:

I like to eat broccoli and bananas.
I ate a banana and spinach smoothie for breakfast.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.

What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

Sentences 1 and 2: 100% Topic A
Sentences 3 and 4: 100% Topic B
Sentence 5: 60% Topic A, 40% Topic B
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

The question, of course, is: how does LDA perform this discovery?

LDA Model

In more detail, LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you

Decide on the number of words N the document will have (say, according to a Poisson distribution).
Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.
Generate each word w_i in the document by:
- First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
- Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.

Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.

Example

Let’s make an example. According to the above process, when generating some particular document D, you might

Pick 5 to be the number of words in D.
Decide that D will be 1/2 about food and 1/2 about cute animals.
Pick the first word to come from the food topic, which then gives you the word “broccoli”.
Pick the second word to come from the cute animals topic, which gives you “panda”.
Pick the third word to come from the cute animals topic, giving you “adorable”.
Pick the fourth word to come from the food topic, giving you “cherries”.
Pick the fifth word to come from the food topic, giving you “eating”.

So the document generated under the LDA model will be “broccoli panda adorable cherries eating” (note that LDA is a bag-of-words model).

Learning

So now suppose you have a set of documents. You’ve chosen some fixed number of K topics to discover, and want to use LDA to learn the topic representation of each document and the words associated to each topic. How do you do this? One way (known as collapsed Gibbs sampling) is the following:

Go through each document, and randomly assign each word in the document to one of the K topics.
Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
So to improve on them, for each document d…
- Go through each word w in d…
  - And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability). (Also, I’m glossing over a couple of things here, in particular the use of priors/pseudocounts in these probabilities.)
  - In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

Layman’s Explanation

In case the discussion above was a little eye-glazing, here’s another way to look at LDA in a different domain.

Suppose you’ve just moved to a new city. You’re a hipster and an anime fan, so you want to know where the other hipsters and anime geeks tend to hang out. Of course, as a hipster, you know you can’t just ask, so what do you do?

Here’s the scenario: you scope out a bunch of different establishments (documents) across town, making note of the people (words) hanging out in each of them (e.g., Alice hangs out at the mall and at the park, Bob hangs out at the movie theater and the park, and so on). Crucially, you don’t know the typical interest groups (topics) of each establishment, nor do you know the different interests of each person.

So you pick some number K of categories to learn (i.e., you want to learn the K most important kinds of categories people fall into), and start by making a guess as to why you see people where you do. For example, you initially guess that Alice is at the mall because people with interests in X like to hang out there; when you see her at the park, you guess it’s because her friends with interests in Y like to hang out there; when you see Bob at the movie theater, you randomly guess it’s because the Z people in this city really like to watch movies; and so on.

Of course, your random guesses are very likely to be incorrect (they’re random guesses, after all!), so you want to improve on them. One way of doing so is to:

Pick a place and a person (e.g., Alice at the mall).
Why is Alice likely to be at the mall? Probably because other people at the mall with the same interests sent her a message telling her to come.
In other words, the more people with interests in X there are at the mall and the stronger Alice is associated with interest X (at all the other places she goes to), the more likely it is that Alice is at the mall because of interest X.
So make a new guess as to why Alice is at the mall, choosing an interest with some probability according to how likely you think it is.

Go through each place and person over and over again. Your guesses keep getting better and better (after all, if you notice that lots of geeks hang out at the bookstore, and you suspect that Alice is pretty geeky herself, then it’s a good bet that Alice is at the bookstore because her geek friends told her to go there; and now that you have a better idea of why Alice is probably at the bookstore, you can use this knowledge in turn to improve your guesses as to why everyone else is where they are), and eventually you can stop updating. Then take a snapshot (or multiple snapshots) of your guesses, and use it to get all the information you want:

For each category, you can count the people assigned to that category to figure out what people have this particular interest. By looking at the people themselves, you can interpret the category as well (e.g., if category X contains lots of tall people wearing jerseys and carrying around basketballs, you might interpret X as the “basketball players” group).
For each place P and interest category C, you can compute the proportions of people at P because of C (under the current set of assignments), and these give you a representation of P. For example, you might learn that the people who hang out at Barnes & Noble consist of 10% hipsters, 50% anime fans, 10% jocks, and 30% college students.

Real-World Example

Finally, I applied LDA to a set of Sarah Palin’s emails a little while ago (see here for the blog post, or here for an app that allows you to browse through the emails by the LDA-learned categories), so let’s give a brief recap. Here are some of the topics that the algorithm learned:

Trig/Family/Inspiration: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, …
Wildlife/BP Corrosion: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, …
Energy/Fuel/Oil/Mining: energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, …
Gas: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, …
Education/Waste: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, …
Presidential Campaign/Elections: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, …

Here’s an example of an email which fell 99% into the Trig/Family/Inspiration category (particularly representative words are highlighted in blue):

And here’s an excerpt from an email which fell 10% into the Presidential Campaign/Election category (in red) and 90% into the Wildlife/BP Corrosion category (in green):

Introduction to Restricted Boltzmann Machines

2011-07-18T00:00:00+02:00

Suppose you ask a bunch of users to rate a set of movies on a 0-100 scale. In classical factor analysis, you could then try to explain each movie and user in terms of a set of latent factors. For example, movies like Star Wars and Lord of the Rings might have strong associations with a latent science fiction and fantasy factor, and users who like Wall-E and Toy Story might have strong associations with a latent Pixar factor.

Restricted Boltzmann Machines essentially perform a binary version of factor analysis. (This is one way of thinking about RBMs; there are, of course, others, and lots of different ways to use RBMs, but I’ll adopt this approach for this post.) Instead of users rating a set of movies on a continuous scale, they simply tell you whether they like a movie or not, and the RBM will try to discover latent factors that can explain the activation of these movie choices.

More technically, a Restricted Boltzmann Machine is a stochastic neural network (neural network meaning we have neuron-like units whose binary activations depend on the neighbors they’re connected to; stochastic meaning these activations have a probabilistic element) consisting of:

One layer of visible units (users’ movie preferences whose states we know and set);
One layer of hidden units (the latent factors we try to learn); and
A bias unit (whose state is always on, and is a way of adjusting for the different inherent popularities of each movie).

Furthermore, each visible unit is connected to all the hidden units (this connection is undirected, so each hidden unit is also connected to all the visible units), and the bias unit is connected to all the visible units and all the hidden units. To make learning easier, we restrict the network so that no visible unit is connected to any other visible unit and no hidden unit is connected to any other hidden unit.

For example, suppose we have a set of six movies (Harry Potter, Avatar, LOTR 3, Gladiator, Titanic, and Glitter) and we ask users to tell us which ones they want to watch. If we want to learn two latent units underlying movie preferences – for example, two natural groups in our set of six movies appear to be SF/fantasy (containing Harry Potter, Avatar, and LOTR 3) and Oscar winners (containing LOTR 3, Gladiator, and Titanic), so we might hope that our latent units will correspond to these categories – then our RBM would look like the following:

(Note the resemblance to a factor analysis graphical model.)

State Activation

Restricted Boltzmann Machines, and neural networks in general, work by updating the states of some neurons given the states of others, so let’s talk about how the states of individual units change. Assuming we know the connection weights in our RBM (we’ll explain how to learn these below), to update the state of unit $i$:

Compute the activation energy $a_i = \sum_j w_{ij} x_j$ of unit $i$, where the sum runs over all units $j$ that unit $i$ is connected to, $w_{ij}$ is the weight of the connection between $i$ and $j$, and $x_j$ is the 0 or 1 state of unit $j$. In other words, all of unit $i$’s neighbors send it a message, and we compute the sum of all these messages.
Let $p_i = \sigma(a_i)$, where $\sigma(x) = 1/(1 + exp(-x))$ is the logistic function. Note that $p_i$ is close to 1 for large positive activation energies, and $p_i$ is close to 0 for negative activation energies.
We then turn unit $i$ on with probability $p_i$, and turn it off with probability $1 - p_i$.
(In layman’s terms, units that are positively connected to each other try to get each other to share the same state (i.e., be both on or off), while units that are negatively connected to each other are enemies that prefer to be in different states.)

For example, let’s suppose our two hidden units really do correspond to SF/fantasy and Oscar winners.

If Alice has told us her six binary preferences on our set of movies, we could then ask our RBM which of the hidden units her preferences activate (i.e., ask the RBM to explain her preferences in terms of latent factors). So the six movies send messages to the hidden units, telling them to update themselves. (Note that even if Alice has declared she wants to watch Harry Potter, Avatar, and LOTR 3, this doesn’t guarantee that the SF/fantasy hidden unit will turn on, but only that it will turn on with high probability. This makes a bit of sense: in the real world, Alice wanting to watch all three of those movies makes us highly suspect she likes SF/fantasy in general, but there’s a small chance she wants to watch them for other reasons. Thus, the RBM allows us to generate models of people in the messy, real world.)
Conversely, if we know that one person likes SF/fantasy (so that the SF/fantasy unit is on), we can then ask the RBM which of the movie units that hidden unit turns on (i.e., ask the RBM to generate a set of movie recommendations). So the hidden units send messages to the movie units, telling them to update their states. (Again, note that the SF/fantasy unit being on doesn’t guarantee that we’ll always recommend all three of Harry Potter, Avatar, and LOTR 3 because, hey, not everyone who likes science fiction liked Avatar.)

Learning Weights

So how do we learn the connection weights in our network? Suppose we have a bunch of training examples, where each training example is a binary vector with six elements corresponding to a user’s movie preferences. Then for each epoch, do the following:

Take a training example (a set of six movie preferences). Set the states of the visible units to these preferences.
Next, update the states of the hidden units using the logistic activation rule described above: for the $j$th hidden unit, compute its activation energy $a_j = \sum_i w_{ij} x_i$, and set $x_j$ to 1 with probability $\sigma(a_j)$ and to 0 with probability $1 - \sigma(a_j)$. Then for each edge $e_{ij}$, compute $Positive(e_{ij}) = x_i * x_j$ (i.e., for each pair of units, measure whether they’re both on).
Now reconstruct the visible units in a similar manner: for each visible unit, compute its activation energy $a_i$, and update its state. (Note that this reconstruction may not match the original preferences.) Then update the hidden units again, and compute $Negative(e_{ij}) = x_i * x_j$ for each edge.
Update the weight of each edge $e_{ij}$ by setting $w_{ij} = w_{ij} + L * (Positive(e_{ij}) - Negative(e_{ij}))$, where $L$ is a learning rate.
Repeat over all training examples.

Continue until the network converges (i.e., the error between the training examples and their reconstructions falls below some threshold) or we reach some maximum number of epochs.

Why does this update rule make sense? Note that

In the first phase, $Positive(e_{ij})$ measures the association between the $i$th and $j$th unit that we want the network to learn from our training examples;
In the “reconstruction” phase, where the RBM generates the states of visible units based on its hypotheses about the hidden units alone, $Negative(e_{ij})$ measures the association that the network itself generates (or “daydreams” about) when no units are fixed to training data.

So by adding $Positive(e_{ij}) - Negative(e_{ij})$ to each edge weight, we’re helping the network’s daydreams better match the reality of our training examples.

(You may hear this update rule called contrastive divergence, which is basically a fancy term for “approximate gradient descent”.)

Examples

I wrote a simple RBM implementation in Python (the code is heavily commented, so take a look if you’re still a little fuzzy on how everything works), so let’s use it to walk through some examples.

First, I trained the RBM using some fake data.

Alice: (Harry Potter = 1, Avatar = 1, LOTR 3 = 1, Gladiator = 0, Titanic = 0, Glitter = 0). Big SF/fantasy fan.
Bob: (Harry Potter = 1, Avatar = 0, LOTR 3 = 1, Gladiator = 0, Titanic = 0, Glitter = 0). SF/fantasy fan, but doesn’t like Avatar.
Carol: (Harry Potter = 1, Avatar = 1, LOTR 3 = 1, Gladiator = 0, Titanic = 0, Glitter = 0). Big SF/fantasy fan.
David: (Harry Potter = 0, Avatar = 0, LOTR 3 = 1, Gladiator = 1, Titanic = 1, Glitter = 0). Big Oscar winners fan.
Eric: (Harry Potter = 0, Avatar = 0, LOTR 3 = 1, Gladiator = 1, Titanic = 1, Glitter = 0). Oscar winners fan, except for Titanic.
Fred: (Harry Potter = 0, Avatar = 0, LOTR 3 = 1, Gladiator = 1, Titanic = 1, Glitter = 0). Big Oscar winners fan.

The network learned the following weights:

Note that the first hidden unit seems to correspond to the Oscar winners, and the second hidden unit seems to correspond to the SF/fantasy movies, just as we were hoping.

What happens if we give the RBM a new user, George, who has (Harry Potter = 0, Avatar = 0, LOTR 3 = 0, Gladiator = 1, Titanic = 1, Glitter = 0) as his preferences? It turns the Oscar winners unit on (but not the SF/fantasy unit), correctly guessing that George probably likes movies that are Oscar winners.

What happens if we activate only the SF/fantasy unit, and run the RBM a bunch of different times? In my trials, it turned on Harry Potter, Avatar, and LOTR 3 three times; it turned on Avatar and LOTR 3, but not Harry Potter, once; and it turned on Harry Potter and LOTR 3, but not Avatar, twice. Note that, based on our training examples, these generated preferences do indeed match what we might expect real SF/fantasy fans want to watch.

Modifications

I tried to keep the connection-learning algorithm I described above pretty simple, so here are some modifications that often appear in practice:

Above, $Negative(e_{ij})$ was determined by taking the product of the $i$th and $j$th units after reconstructing the visible units once and then updating the hidden units again. We could also take the product after some larger number of reconstructions (i.e., repeat updating the visible units, then the hidden units, then the visible units again, and so on); this is slower, but describes the network’s daydreams more accurately.
Instead of using $Positive(e_{ij})=x_i * x_j$, where $x_i$ and $x_j$ are binary 0 or 1 states, we could also let $x_i$ and/or $x_j$ be activation probabilities. Similarly for $Negative(e_{ij})$.
We could penalize larger edge weights, in order to get a sparser or more regularized model.
When updating edge weights, we could use a momentum factor: we would add to each edge a weighted sum of the current step as described above (i.e., $L * (Positive(e_{ij}) - Negative(e_{ij})$) and the step previously taken.
Instead of using only one training example in each epoch, we could use batches of examples in each epoch, and only update the network’s weights after passing through all the examples in the batch. This can speed up the learning by taking advantage of fast matrix-multiplication algorithms.

Topic Modeling the Sarah Palin Emails

2011-06-27T04:15:00+02:00

LDA-based Email Browser

Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I’ve been working on some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups.

I threw up a simple demo app to view the organized documents here.

What is Latent Dirichlet Allocation?

Briefly, given a set of documents, LDA tries to learn the latent topics underlying the set. It represents each document as a mixture of topics (generated from a Dirichlet distribution), each of which emits words with a certain probability.

For example, given the sentence “I listened to Justin Bieber and Lady Gaga on the radio while driving around in my car”, an LDA model might represent this sentence as 75% about music (a topic which, say, emits the words Bieber with 10% probability, Gaga with 5% probability, radio with 1% probability, and so on) and 25% about cars (which might emit driving with 15% probability and cars with 10% probability).

If you’re familiar with latent semantic analysis, you can think of LDA as a generative version. (For a more in-depth explanation, I wrote an introduction to LDA here.)

Sarah Palin Email Topics

Here’s a sample of the topics learnt by the model, as well as the top words for each topic. (Names, of course, are based on my own interpretation.)

Wildlife/BP Corrosion: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, …
Energy/Fuel/Oil/Mining: energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, …
Trig/Family/Inspiration: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, …
Gas: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, …
Education/Waste: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, …
Presidential Campaign/Elections: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, …

Here’s a sample email from the wildlife topic:

I also thought the classification for this email was really neat: the LDA model labeled it as 10% in the Presidential Campaign/Elections topic and 90% in the Wildlife topic, and it’s precisely a wildlife-based protest against Palin as a choice for VP:

Future Analysis

In a future post, I’ll perhaps see if we can glean any interesting patterns from the email topics. For example, for a quick graph now, if we look at the percentage of emails in the Trig/Family/Inspiration topic across time, we see that there’s a spike in April 2008 – exactly (and unsurprisingly) the month in which Trig was born.

Filtering for English Tweets: Unsupervised Language Detection on Twitter

2011-05-01T04:15:00+02:00

(See a demo here.)

While working on a Twitter sentiment analysis project, I ran into the problem of needing to filter out all non-English tweets. (Asking the Twitter API for English-only tweets doesn’t seem to work, as it nonetheless returns tweets in Spanish, Portuguese, Dutch, Russian, and a couple other languages.)

Since I didn’t have any labeled data, I thought it would be fun to build an unsupervised language classifier. In particular, using an EM algorithm to build a naive Bayes model of English vs. non-English n-gram probabilities turned out to work quite well, so here’s a description.

EM Algorithm

Let’s recall the naive Bayes algorithm: given a tweet (a set of character n-grams), we estimate its language to be the language $L$ that maximizes

$$P(language = L | ngrams) \propto P(ngrams | language = L) P(language = L)$$

Thus, we need to estimate $P(ngram | language = L)$ and $P(language = L)$.

This would be easy if we knew the language of each tweet, since we could estimate

$P(xyz| language = English)$ as #(number of times “xyz” is a trigram in the English tweets) / #(total trigrams in the English tweets)
$P(language = English)$ as the proportion of English tweets.

Or, it would also be easy if we knew the n-gram probabilities for each language, since we could use Bayes’ theorem to compute the language probabilities for each tweet, and then take a weighted variant of the previous paragraph.

The problem is that we know neither of these. So what the EM algorithm says is that that we can simply guess:

Pretend we know the language of each tweet (by randomly assigning them at the beginning).
Using this guess, we can compute the n-gram probabilities for each language.
Using the n-gram probabilities for each language, we can recompute the language probabilities of each tweet.
Using these recomputed language probabilities, we can recompute the n-gram probabilities.
And so on, recomputing the language probabilities and n-gram probabilities over and over. While our guesses will be off in the beginning, the probabilities will eventually converge to (locally) minimize the likelihood. (In my tests, my language detector would sometimes correctly converge to an English detector, and sometimes it would converge to an English-and-Dutch detector.)

EM Analogy for the Layman

Why does this work? Suppose you suddenly move to New York, and you want a way to differentiate between tourists and New Yorkers based on their activities. Initially, you don’t know who’s a tourist and who’s a New Yorker, and you don’t know which are touristy activities and which are not. So you randomly place people into two groups A and B. (You randomly assign all tweets to a language)

Now, given all the people in group A, you notice that a large number of them visit the Statue of Liberty; similarly, you notice that a large number of people in group B walk really quickly. (You notice that one set of words often has the n-gram “ing”, and that another set of words often has the n-gram “ias”; that is, you fix the language probabilities for each tweet, and recompute the n-gram probabilities for each language.)

So you start to put people visiting the Statue of Liberty in group A, and you start to put fast walkers in group B. (You fix the n-gram probabilities for each language, and recompute the language probabilities for each tweet.)

With your new A and B groups, you notice more differentiating factors: group A people tend to carry along cameras, and group B people tend to be more finance-savvy.

So you start to put camera-carrying folks in group A, and finance-savvy folks in group B.

And so on. Eventually, you settle on two groups of people and differentiating activities: people who walk slowly and visit the Statue of Liberty, and busy-looking people who walk fast and don’t visit. Assuming there are more native New Yorkers than tourists, you can then guess that the natives are the larger group.

Results

I wrote some Ruby code to implement the above algorithm, and trained it on half a million tweets, using English and “not English” as my two languages. The results looked surprisingly good from just eyeballing:

But in order to get some hard metrics and to tune parameters (e.g., n-gram size), I needed a labeled dataset. So I pulled a set of English-language and Spanish-language documents from Project Gutenberg, and split them to form training and test sets (the training set consisted of 2000 lines of English and 1000 lines of Spanish, and 1000 lines of English and 1000 lines of Spanish for the test set).

Trained on bigrams, the detector resulted in:

991 true positives (English lines correctly classified as English)
9 false negatives (English lines incorrectly classified as Spanish
11 false positives (Spanish lines incorrectly classified as English)
989 true negatives (Spanish lines correctly classified as English)

for a precision of 0.989 and a recall of 0.991.

Trained on trigrams, the detector resulted in:

992 true positives
8 false negatives
10 false positives
990 true negatives

for a precision of 0.990 and a recall of 0.992.

Also, when I looked at the sentences the detector was making errors on, I saw that they almost always consisted of only one or two words (e.g., the incorrectly classified sentences were lines like “inmortal”, “autumn”, and “salir”). So the detector pretty much never made a mistake on a normal sentence!

Code/Demo

I put the code on my Github account, and a quick demo app, trained on trigrams from tweets with lang=”en” according to the Twitter API, is here.

Choosing a Machine Learning Classifier

2011-04-27T04:15:00+02:00

How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you’re simply looking for a “good enough” algorithm for your problem, or a place to start, here are some general guidelines I’ve found to work well over the years.

How large is your training set?

If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren’t powerful enough to provide accurate models.

You can also think of this as a generative model vs. discriminative model distinction.

Advantages of some particular algorithms

Advantages of Naive Bayes: Super simple, you’re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well. Its main disadvantage is that it can’t learn interactions between features (e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together).

Advantages of Logistic Regression: Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

Advantages of Decision Trees: Easy to interpret and explain (for some people – I’m not sure I fall into this camp). They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don’t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.

Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive, hard to interpret, and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.

But…

Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, then whichever classification algorithm you use might not matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).

And to reiterate what I said above, if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize (and Middle Earth), just use an ensemble method to choose them all.

Kickstarter Data Analysis: Success and Pricing

2011-04-25T04:15:00+02:00

Kickstarter is an online crowdfunding platform for launching creative projects. When starting a new project, project owners specify a deadline and the minimum amount of money they need to raise. They receive the money (less a transaction fee) only if they reach or exceed that minimum; otherwise, no money changes hands.

What’s particularly fun about Kickstarter is that in contrast to that other microfinance site, Kickstarter projects don’t ask for loans; instead, patrons receive pre-specified rewards unique to each project. For example, someone donating money to help an artist record an album might receive a digital copy of the album if they donate 20 dollars, or a digital copy plus a signed physical cd if they donate 50 dollars.

There are a bunch of neat projects, and I’m tempted to put one of my own on there soon, so I thought it would be fun to gather some data from the site and see what makes a project successful.

Ending Soon

The categories section really only provides a history of successful projects, though, so to get some data on unsuccessful projects as well, I took a look at the Ending Soon section of projects whose deadlines are about to pass.

It looks like about 50% of all Kickstarter projects get successfully funded by the deadline:

Interestingly, most of the final funding seems to happen in the final few days: with just 5 days left, only about 20% of all projects have been funded. (In other words, with just 5 days left, 60% of the projects that will eventually be successful are still unfunded.) So the approaching deadline seems to really spur people to donate. I wonder if it’s because of increased publicity in the final few days (the project owners begging everyone for help!) or if it’s simply procrastination in action (perhaps people want to wait to see if their donation is really necessary)?

Lesson: if you’re still not fully funded with only a couple days remaining, don’t despair.

Success vs. Failure

What factors lead a project to succeed? Are there any quantitative differences between projects that eventually get funded and those that don’t?

Two simple (if kind of obvious) things I noticed are that unsuccessful projects tend to require a larger amount of money:

and unsuccessful projects also tend to raise less money in absolute terms (i.e., it’s not just that they ask for too much money to reach their goal – they’re simply not receiving enough money as well):

Not terribly surprising, but it’s good to confirm (and I’m still working on finding other predictors).

Pledge Rewards

There’s a lot of interesting work in behavioral economics on pricing and choice – for example, the anchoring effect suggests that when building a menu, you should include an expensive item to make other menu items look reasonably priced in comparison, and the paradox of choice suggests that too many choices lead to a decision freeze – so one aspect of the Kickstarter data I was especially interested in was how pricing of rewards affects donations. For example, does pricing the lowest reward at 25 dollars lead to more money donated (people don’t lowball at 5 dollars instead) or less money donated (25 dollars is more money than most people are willing to give)? And what happens if a new reward at 5 dollars is added – again, does it lead to more money (now people can donate something they can afford) or less money (the people that would have paid 25 dollars switch to a 5 dollar donation)?

First, here’s a look at the total number of pledges at each price. (More accurately, it’s the number of claimed rewards at each price.) [Update: the original version of this graph was wrong, but I’ve since fixed it.]

Surprisingly, 5 dollar and 1 dollar donations are actually not the most common contribution.

To investigate pricing effects, I started by looking at all (successful) projects that had a reward priced at 1 dollar, and compared the number of donations at 1 dollar with the number of donations at the next lowest reward.

Up to about 15-20 dollars, there’s a steady increase in the proportion of people who choose the second reward over the first reward, but after that, the proportion decreases.

So this perhaps suggests that if you’re going to price your lowest reward at 1 dollar, your next reward should cost roughly 20 dollars (or slightly more, to maximize your total revenue). Pricing above 20 dollars is a little too expensive for the folks who want to support you, but aren’t rich enough to throw gads of money; maybe rewards below 20 dollars aren’t good enough to merit the higher donation.

Next, I’m planning on digging a little deeper into pricing effects and what makes a project successful, so I’ll hopefully have some more Kickstarter analysis in a future post. In the meantime, in case anyone else wants to take a look, I put the data onto my Github account.

A Mathematical Introduction to Least Angle Regression

2011-04-21T04:15:00+02:00

(For a layman’s introduction, see here.)

Least Angle Regression (aka LARS) is a model selection method for linear regression (when you’re worried about overfitting or want your model to be easily interpretable). To motivate it, let’s consider some other model selection methods:

Forward selection starts with no variables in the model, and at each step it adds to the model the variable with the most explanatory power, stopping if the explanatory power falls below some threshold. This is a fast and simple method, but it can also be too greedy: we fully add variables at each step, so correlated predictors don’t get much of a chance to be included in the model. (For example, suppose we want to build a model for the deliciousness of a PB&J sandwich, and two of our variables are the amount of peanut butter and the amount of jelly. We’d like both variables to appear in our model, but since amount of peanut butter is (let’s assume) strongly correlated with the amount of jelly, once we fully add peanut butter to our model, jelly doesn’t add much explanatory power anymore, and so it’s unlikely to be added.)
Forward stagewise regression tries to remedy the greediness of forward selection by only partially adding variables. Whereas forward selection finds the variable with the most explanatory power and goes all out in adding it to the model, forward stagewise finds the variable with the most explanatory power and updates its weight by only epsilon in the correct direction. (So we might first increase the weight of peanut butter a little bit, then increase the weight of peanut butter again, then increase the weight of jelly, then increase the weight of bread, and then increase the weight of peanut butter once more.) The problem now is that we have to make a ton of updates, so forward stagewise can be very inefficient.

LARS, then, is essentially forward stagewise made fast. Instead of making tiny hops in the direction of one variable at a time, LARS makes optimally-sized leaps in optimal directions. These directions are chosen to make equal angles (equal correlations) with each of the variables currently in our model. (We like peanut butter best, so we start eating it first; as we eat more, we get a little sick of it, so jelly starts looking equally appetizing, and we start eating peanut butter and jelly simultaneously; later, we add bread to the mix, etc.)

In more detail, LARS works as follows:

Assume for simplicity that we’ve standardized our explanatory variables to have zero mean and unit variance, and that our response variable also has zero mean.
Start with no variables in your model.
Find the variable $ x_1 $ most correlated with the residual. (Note that the variable most correlated with the residual is equivalently the one that makes the least angle with the residual, whence the name.)
Move in the direction of this variable until some other variable $ x_2 $ is just as correlated.
At this point, start moving in a direction such that the residual stays equally correlated with $ x_1 $ and $ x_2 $ (i.e., so that the residual makes equal angles with both variables), and keep moving until some variable $ x_3 $ becomes equally correlated with our residual.
And so on, stopping when we’ve decided our model is big enough.

For example, consider the following image (slightly simplified from the original LARS paper; $x_1, x_2$ are our variables, and $y$ is our response):

Our model starts at $ \hat{\mu_0} $.

The residual (the green line) makes a smaller angle with $ x_1 $ than with $ x_2 $, so we start moving in the direction of $ x_1 $. At $ \hat{\mu_1} $, the residual now makes equal angles with $ x_1, x_2 $, and so we start moving in a new direction that preserves this equiangularity/equicorrelation.
If there were more variables, we’d change directions again once a new variable made equal angles with our residual, and so on.

So when should you use LARS, as opposed to some other regularization method like lasso? There’s not really a clear-cut answer, but LARS tends to give very similar results as both lasso and forward stagewise (in fact, slight modifications to LARS give you lasso and forward stagewise), so I tend to just use lasso when I do these kinds of things, since the justifications for lasso make a little more sense to me. In fact, I don’t usually even think of LARS as a model selection method in its own right, but rather as a way to efficiently implement lasso (especially if you want to compute the full regularization path).

Introduction to Cointegration and Pairs Trading

2011-04-16T04:15:00+02:00

Introduction

Suppose you see two drunks (i.e., two random walks) wandering around. The drunks don’t know each other (they’re independent), so there’s no meaningful relationship between their paths.

But suppose instead you have a drunk walking with her dog. This time there is a connection. What’s the nature of this connection? Notice that although each path individually is still an unpredictable random walk, given the location of one of the drunk or dog, we have a pretty good idea of where the other is; that is, the distance between the two is fairly predictable. (For example, if the dog wanders too far away from his owner, she’ll tend to move in his direction to avoid losing him, so the two stay close together despite a tendency to wander around on their own.) We describe this relationship by saying that the drunk and her dog form a cointegrating pair.

In more technical terms, if we have two non-stationary time series X and Y that become stationary when differenced (these are called integrated of order one series, or I(1) series; random walks are one example) such that some linear combination of X and Y is stationary (aka, I(0)), then we say that X and Y are cointegrated. In other words, while neither X nor Y alone hovers around a constant value, some combination of them does, so we can think of cointegration as describing a particular kind of long-run equilibrium relationship. (The definition of cointegration can be extended to multiple time series, with higher orders of integration.)

Other examples of cointegrated pairs:

Income and consumption: as income increases/decreases, so too does consumption.
Size of police force and amount of criminal activity
A book and its movie adaptation: while the book and the movie may differ in small details, the overall plot will remain the same.
Number of patients entering or leaving a hospital

An application

So why do we care about cointegration? In quantitative finance, cointegration forms the basis of the pairs trading strategy: suppose we have two cointegrated stocks X and Y, with the particular (for concreteness) cointegrating relationship X - 2Y = Z, where Z is a stationary series of zero mean. For example, X could be McDonald’s, Y could be Burger King, and the cointegration relationship would mean that X tends to be priced twice as high as Y, so that when X is more than twice the price of Y, we expect X to move down or Y to move up in the near future (and analogously, if X is less than twice the price of Y, we expect X to move up or Y to move down). This suggests the following trading strategy: if X - 2Y > d, for some positive threshold d, then we should sell X and buy Y (since we expect X to decrease in price and Y to increase), and similarly, if X - 2Y < -d, then we should buy X and sell Y.

Spurious regression

But why do we need the notion of cointegration at all? Why can’t we simply use, say, the R-squared between X or Y to see if X and Y have some kind of relationship? The reason is that standard regression analysis fails when dealing with non-stationary variables, leading to spurious regressions that suggest relationships even when there are none.

For example, suppose we regress two independent random walks against each other, and test for a linear relationship. A large percentage of the time, we’ll find high R-squared values and low p-values when using standard OLS statistics, even though there’s absolutely no relationship between the two random walks. As an illustration, here I simulated 1000 pairs of random walks of length 100, and found p-values less than 0.05 in 77% of the cases:

A Cointegration Test

So how do you detect cointegration? There are several different methods, but the simplest is the Engle-Granger test, which works roughly as follows:

Check that $ X_t $ and $ Y_t $ are both I(1).
Estimate the cointegrating relationship $ Y_t = aX_t + e_t $ by ordinary least squares.
Check that the cointegrating residuals $ e_t $ are stationary (say, by using a so-called unit root test, e.g., the Dickey-Fuller test).

Error-correction and Granger representation

Something else that should perhaps be mentioned is the relationship between cointegration and error-correction mechanisms: suppose we have two cointegrated series $ X_t, Y_t $, with autoregressive representations

$ X_t = a X_{t-1} + b Y_{t-1} + u_t $ $ Y_t = c X_{t-1} + d Y_{t-1} + v_t $

By the Granger representation theorem (which is actually a bit more general than this), we then have

$ \Delta X_t = \alpha_1 (Y_{t-1} - \beta X_{t-1}) + u_t $ $ \Delta Y_t = \alpha_2 (Y_{t-1} - \beta X_{t-1}) + v_t $

where $ Y_{t-1} - \beta X_{t-1} \sim I(0) $ is the cointegrating relationship. Regarding $ Y_{t-1} - \beta X_{t-1} $ as the extent of disequilibrium from the long-run relationship, and the $ \alpha_i $ as the speed (and direction) at which the time series correct themselves from this disequilibrium, we can see that this formalizes the way cointegrated variables adjust to match their long-run equilbrium.

Summary

So, just to summarize a bit, cointegration is an equilibrium relationship between time series that individually aren’t in equilbrium (you can kind of contrast this with (Pearson) correlation, which describes a linear relationship), and it’s useful because it allows us to incorporate both short-term dynamics (deviations from equilibrium) and long-run expectations (corrections to equilibrium).

Counting Clusters

2011-03-14T04:15:00+01:00

Given a set of datapoints, we often want to know how many clusters the datapoints form. The gap statistic and the prediction strength are two practical algorithms for choosing the number of clusters.

Gap Statistic

The gap statistic algorithm works as follows:

For each i from 1 up to some maximum number of clusters,

Run a k-means algorithm on the original dataset to find i clusters, and sum the distance of all points from their cluster mean. Call this sum the dispersion.
Generate a set of reference datasets (of the same size as the original). One simple way of generating a reference dataset is to sample uniformly from the original dataset’s bounding rectangle; a more sophisticated approach is take into account the original dataset’s shape by sampling, say, from a rectangle formed from the original dataset’s principal components.
Calculate the dispersion of each of these reference datasets, and take their mean.
Define the ith gap by: log(mean dispersion of reference datasets) - log(dispersion of original dataset).

Once we’ve calculated all the gaps (we can add confidence intervals as well; see the original paper for the formula), we can select the number of clusters to be the one that gives the maximum gap. (Sidenote: I view the gap statistic as a very statistical-minded algorithm, since it compares the original dataset against a set of reference “control” datasets.)

For example, here I’ve generated three Gaussian clusters:

And running the gap statistic algorithm, we see that it correctly detects the number of clusters to be three:

For a sample R implementation of the gap statistic, see the Github repository here.

Prediction Strength

Another cluster-counting algorithm is the prediction strength algorithm. In contrast to the gap statistic (which, as mentioned above, I find very statistically), I see prediction strength as taking a more machine learning viewpoint, since it’s formulated as a supervised learning problem validated against a test set.

To calculate prediction strength, for each i from 1 up to some maximum number of clusters:

Divide the dataset into two groups, a training set and a test set.
Run a k-means algorithm on each set to find i clusters.
For each test cluster, count the proportion of pairs of points in that cluster that would remain in the same cluster, if each were assigned to its closest training cluster mean.
The minimum over these proportions is the prediction strength for i clusters.

Once we’ve calculated the prediction strength for each number of clusters, we select the number of clusters to be the maximum i such that the prediction strength for i is greater than some threshold. (The paper suggests 0.8 - 0.9 as a good threshold, and I’ve seen 0.8 work well in practice.)

Here’s the prediction strength algorithm run on the same example above:

Again, check out a sample R implementation of the prediction strength here.

In practice, I tend to prefer using the gap statistic algorithm, since it’s a little easier to code and it doesn’t require selecting an arbitrary threshold like the prediction strength does. I’ve also found that it gives slightly better results (though the original prediction strength paper has the opposite finding).

Appendix

I ended up giving a brief description of two very common clustering algorithms, k-means and Gaussian mixture models in the comments, so I figured I might as well bring them up here.

k-means algorithm

Suppose we have a set of datapoints that we want to cluster. We want to learn two things:

A description of the clusters themselves (so that if new points come in, we can assign them to a cluster).
Which clusters our current points fall into.

We start by initializing k cluster centers (e.g., by randomly choosing k points among our datapoints). Then we repeatedly

Step A: Assign each datapoint to the nearest cluster center.
Step B: Update all the cluster centers: for each cluster i, take the mean over all points currently in the cluster, and update cluster center i to be this mean.
(Repeat steps A and B above until the cluster assignments stop changing.)

And that’s pretty much it for k-means.

k-means from an EM point of View

To ease the transition into Gaussian mixture models, let’s also describe the k-means algorithm using EM language.

Note that if we knew for certain either 1) the exact cluster centers or 2) the cluster each point belonged to, we could trivially solve k-means, since

If we knew the exact cluster centers, all we’d have to do is assign each point to its nearest cluster center, and we’d be done.
If we knew which cluster each point belonged to, we could pick the cluster center by simply taking the mean over all points in that cluster.

The problem is that we know neither of these, and so we alternate between making educated guesses of each one:

In A step above, we pretend that we know the cluster centers, and based off this pretense, we guess which cluster each point belongs to. (This is also known as the E step in the EM algorithm.)
In the B step above, we do the reverse: we pretend that we know which cluster each point belongs to, and then try to guess the cluster centers. (This is also known as the M step in EM.)

Our guesses keep getting better and better, and eventually we’ll converge.

Gaussian Mixture Models

k-means has a hard notion of clustering: point X either belongs to cluster C or it doesn’t. But sometimes we want a soft notion instead: point X belongs to cluster C with probability p (according to a Gaussian kernel). This is where Gaussian mixture modeling comes in.

To run a GMM, we start by initializing $k$ Gaussians (say, by randomly choosing $k$ points to be the centers of the Gaussians and by setting the variance of each Gaussians to be the overall variance), and then we repeatedly:

E Step: Pretend we know the parameters of each Gaussian cluster, and assign each datapoint to Gaussian cluster i with appropriate probability.
M Step: Pretend we know the probabilities that each point belongs to a given cluster. Using these probabilities, update the means and variances of each Gaussian cluster: the new mean for cluster i is the weighted mean over all points (where the weight of each point X is the probability that X belongs to cluster i), and similarly for the new variance.

This is exactly like k-means in the EM formulation, except we replace the binary clustering formula with Gaussian kernels.

Hacker News Analysis

2011-03-14T04:15:00+01:00

I was playing around with the Hacker News database Ronnie Roller made (thanks!), so I thought I’d post some of the things I looked at.

Activity on the Site

My first question was how activity on the site has increased over time. I looked at number of posts, points on posts, comments on posts, and number of users.

Posts

This looks like a strong linear fit, with an increase of 292 posts every month.

Comments

For comments, I fit a quadratic regression:

Points

A quadratic regression was also a better fit for points by month:

Users

And again for the number of distinct users with a submission:

Points and Comments

My next question was how points and comments related. Intuitively, posts with more points should have more comments, but it’s nice to check (maybe really good posts are kind of boring, so don’t lead to much discussion).

First, I plotted the points and comments of each individual post:

As expected, there’s an overall positive correlation between points and comments. Interestingly, there are quite a few high-points posts with no comments.

The plot’s quite noisy, though, so let’s try cleaning it up a bit, by taking the median number of comments per points level (and removing posts at the higher end, where we have little data):

We see that posts with more points do tend to have more comments. Also, variance in number of comments is indicated by size and color, so (unsurprisingly) posts with more points have larger variance in their number of comments.

Quality of Posts

Another question was whether the quality of posts has degraded over time.

First, I computed a normalized “score” for each post, where a post’s score is defined as the number of points divided by the number of distinct users who made a submission in the same month. (The denominator is a rough proxy for the number of active users, and the goal of the score is to provide a way to compare posts across time.)

While the median score has declined over time (as perhaps should be expected, since only a fixed number of items can reach the front page):

the absolute number of quality posts, defined as posts with a score greater than the (admittedly arbitrarily chosen) threshold 0.01, has increased (until possibly a dip starting in 2010):

(Of course, without some further analysis, it’s not clear how well this score measures quality of posts, so take these numbers with a grain of salt.)

Company Trends

Also, I wanted to see how certain topics have trended over time, so I looked at how mentions of some of the big-name companies (Google, Facebook, Microsoft, Yahoo, Twitter, Apple) have changed. For each company, I plotted the percentage of posts with the company’s name in the title, and also made a smoothed plot comparing all six at the end. Note that Microsoft and Yahoo seem to be trending slightly downward, and Apple seems to be trending upward.

Layman's Introduction to Measure Theory

2011-03-14T04:15:00+01:00

Measure theory studies ways of generalizing the notions of length/area/volume. Even in 2 dimensions, it might not be clear how to measure the area of the following fairly tame shape:

much less the “area” of even weirder shapes in higher dimensions or different spaces entirely.

For example, suppose you want to measure the length of a book (so that you can get a good sense of how long it takes to read). What’s a good measure? One possibility is to measure a book’s length in pages. Since books provide page counts, this is a fairly easy measure to get. However, different versions of the same book (e.g., hardcover and paperback versions) tend to have different page counts, so this page measure doesn’t satisfy the nice property of version invariance (which we would like to have, since hardcover and paperback versions of the same book take the same time to read). Also, not all books even have page counts (think Kindle books), so this measure doesn’t allow us to measure the length of all books we might want to read.

Another, possibly better measure is to measure a book’s length in terms of the number of words it contains. Now we do have version invariance (hardcover and paperback versions contain the same number of words) and we can measure the length of Kindle books as well. We can even do things like add two books together, and the measure/number of words of the concatenated books will pleasantly equal the sum of the measures/number of words of each book alone.

However, what happens when we try to measure a picture book’s length in words? We can’t – picture books are too pathological. Maybe we could say that a picture book has measure zero (since a picture book has no words), but then we get unhappy things like books of measure zero taking a really long time to read (imagine a really long picture book). So maybe a better option is to say that picture books are simply unmeasurable. Whenever someone asks for the length of a picture book, we ignore them, and this way our measure will continue to be a good approximation of reading time and we get to keep our other nice properties as well.

Similarly, measure theory asks questions like:

How do we define a measure on our space? (Jordan measure and Lebesgue measure are two different options in Euclidean space.)
What properties does our measure satisfy? (For example, does it satisfy translational invariance, rotational invariance, additivity?)
Which objects are measurable/which objects can we say it’s okay not to measure in order to preserve nice properties of our measure? (The Banach-Tarski ball can be rigidly reassembled into two copies of the same shape and size as the original, so we don’t want it to be measurable, since then we would lose additivity properties.)

And once we’ve defined a “generalized area” (our measure), we can try to generalize other mathematical concepts as well. For example, recall that the (Riemann) integral that you learn in calculus measures the area under a curve. What happens if we replace the “area” in the Riemann integral with our new, generalized measure (e.g., to get the Lebesgue integral)? Measure theory also helps make certain probability statements mathematically precise (e.g., we can say exactly what it means that a fair coin flipped infinitely often will “almost never” land heads more than 50% of the time).

Layman's Introduction to Random Forests

2011-03-14T04:15:00+01:00

Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.

Thus, Willow is a decision tree for your movie preferences.

But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends, and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).

Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you really really loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (formally, you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don’t mention Harry Potter at all.

By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.

There’s still one problem with your data, however. While you loved both Titanic and Inception, it wasn’t because you like movies that star Leonardio DiCaprio. Maybe you liked both movies for other reasons. Thus, you don’t want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you’re building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren’t allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you’re injecting randomness at the model level, by making your friends ask different questions at different times.

And so your friends now form a random forest.

Netflix Prize Summary: Factorization Meets the Neighborhood

2011-03-14T04:15:00+01:00

(Way back when, I went through all the Netflix prize papers. I’m now (very slowly) trying to clean up my notes and put them online. Eventually, I hope to have a more integrated tutorial, but here’s a rough draft for now.)

This is a summary of Koren’s 2008 Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model.

There are two approaches to collaborative filtering: neighborhood methods and latent factor models.

Neighborhood models are most effective at detecting very localized relationships (e.g., that people who like X-Men also like Spiderman), but poor at detecting a user’s overall signals.
Latent factor models are best at estimating overall structure (e.g., that a user likes horror movies), but are poor at detecting strong associations among small sets of closely related items.

Since the two approaches have complementary strengths and weaknesses, we should integrate the two; this integration is the focus of this paper.

Preliminaries

As mentioned in previous papers, we should normalize out common effects from movies. Throughout the rest of this paper, Koren uses a baseline estimate of overall rating mean + user deviation from average + movie deviation from average for the rating of user i on movie i; estimation of the latter two parameters are done by solving a regularized least squares problem.

Koren then describes using a binary matrix (1 for rated, 0 for not rated) as a source of implicit feedback. This is useful because the mere fact that a user rated many science fiction movies (say) suggests that the user likes science fiction movies.

A Neighborhood Model

Recall the previous paper, where we modeled each rating $r_{ui}$ as

$$r_{ui} = b_{ui}+ \sum_{N \in N(i; u)} (r_{uj} - b_{uj}) w_{ij},$$

where $N(i; u)$ is the k items most similar to i among the items user u rated, and the $w_{ij}$ are parameters to be learned by solving a regularized least squares problem.

This paper makes several enhancements to that model. First, we replace $N(i; u)$ with $R^k(i; u)$, the intersection of the k items most similar to i (among all items) intersected with the items user u rated. Also, we denote by $N^k(i; u)$ the intersection of the k items most similar to i with the items user u has provided implicit feedback for. This gives us

$$r_{ui} = b_{ui} + \sum_{j \in R^k(i; u)} (r_{uj} - b_{uj}) w_{ij} + \sum_{j \in N^k(i; u)} c_{ij},$$

where the $c_{ij}$ are another set of parameters to learn.

Notice that by taking the intersection of the k items most similar to i with the items user u rated (giving perhaps a set of size less than k), rather than taking the k items most similar to i among the items user u rated, we let our model be influenced not only by what a user rates, but also by what a user does not rate. For example, if a user does not rate LOTR 1 or LOTR 2, his predicted rating for LOTR 3 is penalized.

This implies that our current model encourages greater deviations from baseline estimates for users that provided many ratings or plenty of implicit feedback. In other words, for well-modeled users with a lot of input, we are willing to predict quirkier and less common recommendations; users we have less information about, on the other hand, receive safer, baseline estimates.

Nonetheless, this dichotomy between power users and newbie users is perhaps overemphasized by our current model, so we moderate the dichotomy by modifying our model to be

$$r_{ui} = b_{ui} + |R^k(i; u)|^{-0.5} \sum_{j \in R^k(i; u)} (r_{uj} - b_{uj}) w_{ij} + |N^k(i; u)|^{-0.5} \sum_{j \in N^k(i; u)} c_{ij}.$$

Parameters are determined by solving a regularized least squares problem.

Latent Factor Models Revisited

Typical SVD approaches are based on the following rule:

$$r_{ui} = b_{ui} + p_u^T q_i,$$

where $p_u$ is a user-factors vector and $q_i$ is an item-factors vector. We describe two enhancements.

Asymmetric-SVD

One suggestion is to replace $p_u$ with

$$|R(u)|^{-0.5} + \sum_{j \in R(u)} (r_{uj} - b_{uj}) x_j + |N(u)|^{-0.5} \sum_{j \in N(u)} y_j,$$

where $R(u)$ is the set of items user u has rated, and $N(u)$ is the set of items user u has provided implicit feedback for. In other words, this model represents users through the items they prefer, rather than expressing users in a latent feature space. This model has several advantages:

Asymmetric-SVD does not parameterize users, so we do not need to wait to retrain the model when a user comes in. Instead, we can handle new users as soon as they provide feedback.
Predictions are a direct function of past feedback, so we can easily explain predictions. (When using a pure latent feature solution, however, explainability is difficult.)

As usual, parameters are learned via a regularized least-squares minimization.

SVD++

Another approach is to continue modeling users as latent features, while adding implicit feedback. Thus, we replace $p_u$ with $p_u + |N(u)|^{-0.5} \sum_{j \in N(u)} y_j$. While we lose the easily explainability and immediate feedback of the Asymmetric-SVD model, this approach is likely more accurate.

An Integrated Model

An integrated model incorporating baseline estimates, the neighborhood approach, and the latent factor approach is as follows:

$$r_{ui} = \left[\mu + b_u + b_i\right] +\left[q_i^T \big(p_u + \sqrt{|N(u)|}\sum_{j \in N(u)} y_j \big)\right] + \left[\sqrt{|R^k(i;u)} \sum_{j \in R^k(i; u)}(r_{uj} - b_{uj})w_{ij}+\sqrt{|N^k(i;u)|} \sum_{j \in N^k(i; u)} c_{ij}\right].$$

Note that we have used $(\mu + b_u + b_i)$ as our baseline estimate. We also used the SVD++ model, but we could use the Asymmetric-SVD model instead.

This rule provides a 3-tier model for recommendations:

The first baseline group describes general properties of the item and user. For example, it may say that “The Sixth Sense” movie is known to be a good movie in general, and that Joe rates like the average user.
The next latent factor group may say that since “The Sixth Sense” and Joe rate high on the Psychological Thrillers Scale, Joe may like The Sixth Sense because he likes this genre of movies in general.
The final neighborhood tier makes fine-grained adjustments that are hard to file, such as the fact that Joe rated low the movie “Signs”, a similar psychological thriller by the same director.

As usual, model parameters are determined by minimizing the regularized squared error function through gradient descent.

Netflix Prize Summary: Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

2011-03-14T04:15:00+01:00

This is a summary of Bell and Koren’s 2007 Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights paper.

tl;dr This paper’s main innovation is deriving neighborhood weights by solving a least squares problem, instead of using a standard similarity function to compute weights.

This paper improves upon the standard neighborhood approach to collaborative filtering in three areas: better data normalization, better neighbor weights (this is the key section), and better use of user data. I’ll first review the standard neighborhood approach, and follow with a description of these enhancements.

Background: Standard Neighborhood Approach to Collaborative Filtering

Recall that there are two types of neighborhood approaches:

User-based approaches: to predict user i’s rating of item j, take the users most similar to user i, and perform a weighted average of their ratings of item j.
Item-based approaches: to predict user i’s rating of item j, perform a weighted average of user i’s ratings of items similar to item j.

For example, to predict how you would rate the first Harry Potter movie, the user-based approach looks at how your friends rated the first Harry Potter movie, while the item-based approach looks at how you rated movies like Lord of the Rings and Twilight.

Better Data Normalization

Suppose I ask my friend Chris whether I should watch the latest Twilight movie. He tells me he would rate it 4.0/5 stars. Great, that’s a high rating, so that means I should watch it – or does it? It turns out that Chris is a super cheerful guy who’s never met a movie he didn’t like, and his average rating for a movie is actually 4.5/5 stars. So Twilight is actually less than average for him, and hence 4.0/5 stars from Chris isn’t actually that hearty a recommendation.

As another example, suppose you look at doctor ratings on Yelp. They’re abnormally high: the average is far from 3/5 stars. Why is this? Maybe it’s harder for people to change doctors than it is to go to a new restaurant, so people might not want to rate a doctor poorly when they know they’ll have to see the doctor again. Thus, an average rating of 5 stars on a McDonalds restaurant is much more impressive than an average of 5 stars on Dr. Joe.

The lesson is that when using existing ratings, we should normalize out these types of effects, so that ratings are as comparable as possible.

Another way of thinking about this is that we are simply building a regression model. That is, for each user u, we have a model $r_{ui} = (\sum \theta_u x_{ui}) + SpecificRating$, where the $x_{ui}$ are common explanatory variables and we want to estimate $\theta_u$; and similarly for each item i. Once we’ve estimated the $\theta_u$, we can use the fancier neighborhood models on the specific ratings.

For example, suppose we want to predict Bob’s rating of Titanic. We’ve built a regression model with two explanatory variables, whether the movie was Oscar-nominated (1 if so, -1 if not) and whether the movie contains Kate Winslet (1 if so, -1 if not), and we’ve determined that Bob’s weights on these two variables are -2 (Bob tends to hate Oscar movies) and +1.5 (Bob likes Kate Winslet). Similarly, his friend John has weights +1 and -0.5 for these two variables (John likes Oscars, but dislikes Kate Winslet). So if we know that John rated Titanic a 4, then we have 4 = 1(1) + -0.5(1) + (John’s specific rating), so John’s specific rating of Titanic is 3.5. If we use John’s rating alone to estimate Bob’s, we might guess that Bob would rate Titanic -2(1) + 1.5(1) + (John’s specific rating) = 3.0.

To estimate the $\theta_u$, we actually perform this estimation in sequence: each explanatory variable is used to model the residual from the previous explanatory variable. Also, instead of using the maximum-likelihood unbiased estimator $\hat{\theta_u} = \frac{\sum r_{ui} x_{ui}}{x _ {ui} ^ 2}$, we shrink the weights to prevent overfitting. From a Bayesian point of view, the shrinkage arises from a hierarchical model where the true $\theta_u \sim N(\mu, \sigma^2)$, and $\hat{\theta_u} | \theta_u \sim N(\theta_u, \sigma_u^2)$, leading to $E(\theta_u | \hat{\theta_u}) = \frac{\sigma^2 \hat{\theta_u} + \sigma_u^2 \mu}{\sigma^2 + \sigma_u^2}$.

In practice, the explanatory variables Bell and Koren found to work well included the overall mean of all ratings, each movie’s specific mean, each user’s specific mean, time since movie release, time since user join, and number of ratings for each movie.

Better Neighbor Weights

Let’s consider some deficiencies of the neighborhood approach:

Suppose I want to use the first LOTR movie to predict ratings of the first Harry Potter movie. To do this, I need to say how much weight the first LOTR movie should have in this prediction. But how do I choose this weight? Standard neighborhood approaches essentially pick arbitrary similarity functions (e.g., Pearson correlation, cosine distance) as the weight, possibly testing several similarity functions to see which gives the best performance, but is there a more principled approach to choosing weights?
The standard neighborhood approach ignores the fact that neighbors aren’t independent. For example, suppose all three LOTR movies are neighbors of the first HP movie. Since the three LOTR movies are so similar to each other, the standard approach is overcounting their information. Here’s an analogy: suppose I ask five of my friends where I should eat tonight. Three of them live together (boyfriend, girlfriend, and roommate), and they all recently took a trip together to Japan and are sick of Japanese food, so they vehemently recommend against sushi. Thus, my friends’ recommendations have a stronger bias than would appear if I asked five friends who didn’t know each other at all.

We’ll see how using an optimization method to derive weights (as opposed to deriving weights via a similarity function) overcomes these two limitations.

Recall our problem: we want to predict $r_{ui}$, user u’s rating of item i, and what we have is a set $N(i; u)$ of K neighbors of item i that user u has also rated. (These K neighbors are selected via a similarity function, as is standard.) So what we want to do is find weights $w_{ij}$ such that $r_{ui} = \sum_{j \in N(i; u) w_{ij} r_{uj}}$. A natural approach, then, is simply to choose our weights to minimize $\min_w \sum_{v \neq u} \left( r_{vi} - \sum_{j \in N(i; u)} w_{ij} r_{vj}\right)^2$.

Notice how this optimization solves our two problems above: it’s not only a more principled approach (we choose our weights by minimizing squared error), but by deriving weights simultaneously, we overcome interaction effects.

Differentiating our cost function, we find that the optimal weights satisfy the equation $Aw = b$, where A is a $K \times K$ matrix defined by $A_{jk} = \sum_{v \neq u} r_{vj} r_{vk}$ and $b$ is a vector defined by $b_j = \sum_{v \neq u} r_{vj} r_{vi}$.

However, not all users have rated every movie, so some of the ratings may be missing from the above formulas. So we should instead use an estimate of A and b, such as $\bar{A}_{jk} = \frac{\sum_{v \in U(j,k)} r_{vj} r_{vk}}{|U(j, k)|}$, where $U(j, k)$ is the set of users who rated both j and k, and similarly for b. To avoid overfitting, we should further modify by shrinking to a common mean: $\hat{A}_{jk} = \frac{|U(J,K)|\bar{A}_{jk} + \beta A_{\mu}}{|U(j,k)| + \beta}$, where $\beta$ is a shrinkage parameter and $A_{\mu}$ is the mean over all $\bar{A}$, and similarly for b.

Note that another benefit of our optimization-derived weights is that the weights of neighbors are no longer constrained to sum to 1. Thus, if an item simply has no strong neighbors, the neighbors’ prediction will have only a small effect.

Also, when engineering these methods in practice, we should precompute all item-item similarities and all entries in the matrix $A$.

Better Use of User Data

Neighborhood models typically follow the item-based approach for two reasons:

There are typically many more users than items, and new users come in much more frequently than new items, so it is easier to compute all pairs of item-item similarities.
Users have diverse tastes, so they aren’t as similar to each other. For example, Alice and Eve may both like horror movies, but disagree on comedies.

But there are various reasons we might want to use a user-based approach in addition to an item-based approach (say, a user hasn’t rated many items yet, but we can find similar users based on other types of data, such as browsing history; or, we want to predict user u’s rating on item i, but user u hasn’t rated any items similar to i), so let’s see if we can get around these limitations.

To get around the first limitation, we can project users into a lower-dimensional space (say, by using a singular value decomposition), where we can use a space-partitioning data structure (e.g., a kd-tree) or a nearest-neighbor algorithm (e.g., locality sensitive hashing) to find neighboring users.

To get around the second limitation – that a user u may be predictive of user v for some items, but less so for others – we incorporate item-item similarity into our weighting method. That is, when using the user-neighborhood model to predict user u’s rating on item i, we give higher weight to items similar to i, by choosing the weights to minimize $\min_w \sum_{j \neq i} s_{ij} \left( r_{uj} - \sum_{v \in N(u, i)} w_{uv} r_{vj} \right)^2,$ where the $s_{ij}$ are item-item similarities.

Appendix: Shrinkage

Parameter shrinkage is used a couple times in the paper, so let’s explain what it means.

Suppose that we want to estimate the probability of a coin. If we flip it once and see heads, then the maximum-likelihood estimate of heads is 1. But (as is typical for maximum-likelihood estimates), this is severe overfitting, and what we should do instead is shrink this maximum-likelihood estimate to a prior estimate of the probability of heads, say 1/2. (Note that shrinkage doesn’t necessarily mean decreasing the number, just moving it towards a prior estimate).

How should we perform this shrinkage? If our maximum-likelihood estimate of our parameter $\theta$ is $x$ and our prior mean is $\mu$, a natural estimation of $\theta$ is to use a weighted mean $\alpha x + (1 - \alpha)\mu$, where $\alpha$ is some measure of the degree of belief in our maximum likelihood estimate.

This weighted average approach has several interpretations:

We can also view it as a shrinkage of our maximum likelihood estimate to our prior mean: $\alpha x + (1 - \alpha)\mu = x + (1 - \alpha) (\mu - x)$
We can also view it as a Bayesian posterior: if we use a prior $\theta \sim N(\mu, \tau)$ (where $\tau$ is the precision of our Gaussian, not the variance) and a conditional distribution $x | \theta \sim N(\theta, \tau_x)$, then the posterior mean of $\theta$ is $\theta = \frac{\tau_x}{\tau_x + \tau}x + \frac{\tau}{\tau_x + \tau}\mu,$ which is equivalent to the form above.

Prime Numbers and the Riemann Zeta Function

2011-03-14T04:15:00+01:00

Lots of people know that the Riemann Hypothesis has something to do with prime numbers, but most introductions fail to say what or why. I’ll try to give one angle of explanation.

Layman’s Terms

Suppose you have a bunch of friends, each with an instrument that plays at a frequency equal to the imaginary part of a zero of the Riemann zeta function. If the Riemann Hypothesis holds, you can create a song that sounds exactly at the prime-powered beats, by simply telling all your friends to play at the same volume.

Mathematical Terms

Let $ \pi(x) $ denote the number of primes less than or equal to x. Recall Gauss’s approximation: $ \pi(x) \approx \int\_2\^x \frac{1}{\log t} \,dt $ (aka, the “probability that a number n is prime” is approximately $ \frac{1}{\log n} $).

Riemann improved on Gauss’s approximation by discovering an exact formula $ P(x) = A(x) - E(x) $ for counting the primes, where

$ P(x) = \sum\_{p\^k < x} \frac{1}{k} $ performs a weighted count of the prime powers less than or equal to x. [Think of this as a generalization of the prime counting function.]
$ A(x) = \int\_0\^x \frac{1}{\log t} \,dt+ \int\_x\^{\infty} \frac{1}{t(t\^2 -1) \log t} \,dt $ $ - \log 2 $ is a kind of generalization of Gauss’s approximation.
$ E(x) = \sum\_{z : \zeta(z) = 0} \int\_0\^{x\^z} \frac{1}{\log t} \,dt $ is an error-correcting factor that depends on the zeroes of the Riemann zeta function.

In other words, if we use a simple Gauss-like approximation to the distribution of the primes, the zeroes of the Riemann zeta function sweep up after our errors.

Let’s dig a little deeper. Instead of using Riemann’s formula, I’m going to use an equivalent version

$$ \psi(x) = (x + \sum\_{n = 1}\^{\infty} \frac{x\^{-2n}}{2n} - \log 2\pi) - \sum\_{z : \zeta(z) = 0} \frac{x\^z}{z} $$

where $ \psi(x) = \sum\_{p\^k \le x} \log p $. Envisioning this formula to be in the same $P(x) = A(x) - E(x)$ form as above, this time where

$ P(x) = \psi(x) = \sum\_{p\^k \le x} \log p $ is another kind of count of the primes.
$ A(x) = x + \sum\_{n = 1}\^{\infty} \frac{x\^{-2n}}{2n} - \log 2\pi $ is another kind of approximation to $P(x)$.
$ E(x) = \sum\_{z : \zeta(z) = 0} \frac{x\^z}{z} $ is another error-correction factor that depends on the zeroes of the Riemann zeta function.

we can again interpret it as an error-correcting formula for counting the primes.

Now since $ \psi(x) $ is a step function that jumps at the prime powers, its derivative $ \psi’(x) $ has spikes at the prime powers and is zero everywhere else. So consider

$$ \psi’(x) = 1 - \frac{1}{x(x\^2 - 1)} - \sum\_z x\^{z-1} $$

It’s well-known that the zeroes of the Riemann zeta function are symmetric about the real axis, so the (non-trivial) zeroes come in conjugate pairs $ z, \bar{z} $. But $ x\^{z-1} + x\^{\bar{z} - 1} $ is just a wave whose amplitude depends on the real part of z and whose frequency depends on the imaginary part (i.e., if $ z = a + bi $, then $ x\^{z-1} + x\^{\bar{z}-1} = 2x\^{a-1} cos (b \log x) $), which means $ \psi’(x) $ can be decomposed into a sum of zeta-zero waves. Note that because of the $2x\^{a-1}$ term in front, the amplitude of these waves depends only on the real part $a$ of the conjugate zeroes.

For example, here are plots of $ \psi’(x) $ using 10, 50, and 200 pairs of zeroes:

So when the Riemann Hypothesis says that all the non-trivial zeroes have real part 1/2, it’s hypothesizing that the non-trivial zeta-zero waves have equal amplitude, i.e., they make equal contributions to counting the primes.

In Fourier-poetic terms, when Flying Spaghetti Monster composed the music of the primes, he built the notes out of the zeroes of the Riemann zeta function. If the Riemann Hypothesis holds, he made all the non-trivial notes equally loud.

Topological Combinatorics and the Evasiveness Conjecture

2011-03-14T04:15:00+01:00

The Kahn, Saks, and Sturtevant approach to the Evasiveness Conjecture (see the original paper here) is an epic application of pure mathematics to computer science. I’ll give an overview of the approach here, and probably try to add some more information on the problem in other posts.

tl;dr The KSS approach provides an algebraic-topological attack to a combinatorial hypothesis, and reduces a graph complexity problem to a problem of contractibility and (not) finding fixed points.

First, the Evasiveness Conjecture states that any (non-trivial) monotone graph property is evasive. In other words, if you’re trying to figure out whether an undirected n-vertex graph satisfies a certain property (e.g., whether the graph contains a triangle or is connected), and this property is monotone (meaning that if you add more edges to the graph, then it still satisfies the property), then if all you’re allowed to do is ask questions of the form “Is edge (i, j) in the graph?”, then you need to query for every single edge before you can determine whether the graph satisfies the property or not. For example, if you want to figure out whether a graph G contains a clique of size 5, then you need to know whether each of the n(n-1)/2 possible edges is in the graph or not before you can answer for certain.

Next, given any monotone graph property on n-vertex graphs, we can associate it with a simplicial complex S (basically, an n-dimensional structure formed by gluing together a bunch of hypertriangles), by taking the complex to be the set of all n-vertex graphs that don’t satisfy the property.

Kahn, Saks, and Sturtevant then prove that if a monotone graph property is not evasive, then its associated simplicial complex is contractible, and thus (by the Lefschetz Fixed-Point theorem) any auto-simplicial map on the complex (a function from the complex to itself that preserves faces) has a fixed point.

Thus, we can prove that a monotone graph property is evasive by finding a simplicial map that has no fixed point (which we can do by showing that no orbit of the map is a face of the complex). This approach has been used to prove things like the evasiveness of graph properties when the number of vertices is prime or a prime power, and the evasiveness of all bipartite graph properties.

Item-to-Item Collaborative Filtering with Amazon's Recommendation System

2011-02-15T04:15:00+01:00

Introduction

In making its product recommendations, Amazon makes heavy use of an item-to-item collaborative filtering approach. This essentially means that for each item X, Amazon builds a neighborhood of related items S(X); whenever you buy/look at an item, Amazon then recommends you items from that item’s neighborhood. That’s why when you sign in to Amazon and look at the front page, your recommendations are mostly of the form “You viewed… Customers who viewed this also viewed…”.

Other approaches.

The item-to-item approach can be contrasted to:

A user-to-user collaborative filtering approach. This finds users similar to you (e.g., it could find users who bought a lot of items in common with you), and suggest items that they’ve bought but you haven’t.
A global, latent factorization approach. Rather than looking at individual items in isolation (in the item-to-item approach, if you and I both buy a book X, Amazon will make essentially the same recommendations based on X, regardless of what we’ve bought in the past), a global approach would look at all the items you’ve bought, and try to detect properties that characterize what you like. For example, if you buy a lot of science fiction books and also a lot of romance books, a global-approach algorithm might try to recommend you books with both science fiction and romance elements.

Pros/cons of the item-to-item approach:

Pros over the user-to-user approach: Amazon (and most applications) has many more users than items, so it’s computationally simpler to find similar items than it is to find similar users. Finding similar users is also a difficult algorithmic task, since individual users often have a very wide range of tastes, but individual items usually belong to relatively few genres.
Pros over the factorization approach: Simpler to implement. Faster to update recommendations: as soon as you buy a new book, Amazon can make a new recommendation in the item-to-item approach, whereas a factorization approach would have to wait until the factorization has been recomputed. The item-to-item approach can also be more easily leveraged in several areas, not only in the recommendations made to you, but also in the “similar items/other customers also bought” section when you look at a particular item.
Cons of the item-to-item approach: You don’t get very much diversity or surprise in item-to-item recommendations, so recommendations tend to be kind of “obvious” and boring.

How to find similar items

Since the item-to-item approach makes crucial use of similar items, here’s a high-level view of how to do it. First, associate each item with the set of users who have bought/looked at it. The similarity between any two items could then be a normalized measure of the number of users they have in common (i.e., the Jaccard index) or the cosine distance between the two items (imagine each item as a vector, with a 1 in the ith element if user i has bought it, and 0 otherwise).