Topic Modeling the Sarah Palin Emails

LDA-based Email Browser

Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I’ve been working on some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups.

I threw up a simple demo app to view the organized documents here.

What is Latent Dirichlet Allocation?

Briefly, given a set of documents, LDA tries to learn the latent topics underlying the set. It represents each document as a mixture of topics (generated from a Dirichlet distribution), each of which emits words with a certain probability.

For example, given the sentence “I listened to Justin Bieber and Lady Gaga on the radio while driving around in my car”, an LDA model might represent this sentence as 75% about music (a topic which, say, emits the words Bieber with 10% probability, Gaga with 5% probability, radio with 1% probability, and so on) and 25% about cars (which might emit driving with 15% probability and cars with 10% probability).

If you’re familiar with latent semantic analysis, you can think of LDA as a generative version. (For a more in-depth explanation, I wrote an introduction to LDA here.)

Sarah Palin Email Topics

Here’s a sample of the topics learnt by the model, as well as the top words for each topic. (Names, of course, are based on my own interpretation.)

Wildlife/BP Corrosion: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, …
Energy/Fuel/Oil/Mining: energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, …
Trig/Family/Inspiration: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, …
Gas: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, …
Education/Waste: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, …
Presidential Campaign/Elections: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, …

Here’s a sample email from the wildlife topic:

I also thought the classification for this email was really neat: the LDA model labeled it as 10% in the Presidential Campaign/Elections topic and 90% in the Wildlife topic, and it’s precisely a wildlife-based protest against Palin as a choice for VP:

Future Analysis

In a future post, I’ll perhaps see if we can glean any interesting patterns from the email topics. For example, for a quick graph now, if we look at the percentage of emails in the Trig/Family/Inspiration topic across time, we see that there’s a spike in April 2008 – exactly (and unsurprisingly) the month in which Trig was born.

Filtering for English Tweets: Unsupervised Language Detection on Twitter

(See a demo here.)

While working on a Twitter sentiment analysis project, I ran into the problem of needing to filter out all non-English tweets. (Asking the Twitter API for English-only tweets doesn’t seem to work, as it nonetheless returns tweets in Spanish, Portuguese, Dutch, Russian, and a couple other languages.)

Since I didn’t have any labeled data, I thought it would be fun to build an unsupervised language classifier. In particular, using an EM algorithm to build a naive Bayes model of English vs. non-English n-gram probabilities turned out to work quite well, so here’s a description.

EM Algorithm

Let’s recall the naive Bayes algorithm: given a tweet (a set of character n-grams), we estimate its language to be the language $L$ that maximizes

$$P(language = L | ngrams) \propto P(ngrams | language = L) P(language = L)$$

Thus, we need to estimate $P(ngram | language = L)$ and $P(language = L)$.

This would be easy if we knew the language of each tweet, since we could estimate

$P(xyz| language = English)$ as #(number of times “xyz” is a trigram in the English tweets) / #(total trigrams in the English tweets)
$P(language = English)$ as the proportion of English tweets.

Or, it would also be easy if we knew the n-gram probabilities for each language, since we could use Bayes’ theorem to compute the language probabilities for each tweet, and then take a weighted variant of the previous paragraph.

The problem is that we know neither of these. So what the EM algorithm says is that that we can simply guess:

Pretend we know the language of each tweet (by randomly assigning them at the beginning).
Using this guess, we can compute the n-gram probabilities for each language.
Using the n-gram probabilities for each language, we can recompute the language probabilities of each tweet.
Using these recomputed language probabilities, we can recompute the n-gram probabilities.
And so on, recomputing the language probabilities and n-gram probabilities over and over. While our guesses will be off in the beginning, the probabilities will eventually converge to (locally) minimize the likelihood. (In my tests, my language detector would sometimes correctly converge to an English detector, and sometimes it would converge to an English-and-Dutch detector.)

EM Analogy for the Layman

Why does this work? Suppose you suddenly move to New York, and you want a way to differentiate between tourists and New Yorkers based on their activities. Initially, you don’t know who’s a tourist and who’s a New Yorker, and you don’t know which are touristy activities and which are not. So you randomly place people into two groups A and B. (You randomly assign all tweets to a language)

Now, given all the people in group A, you notice that a large number of them visit the Statue of Liberty; similarly, you notice that a large number of people in group B walk really quickly. (You notice that one set of words often has the n-gram “ing”, and that another set of words often has the n-gram “ias”; that is, you fix the language probabilities for each tweet, and recompute the n-gram probabilities for each language.)

So you start to put people visiting the Statue of Liberty in group A, and you start to put fast walkers in group B. (You fix the n-gram probabilities for each language, and recompute the language probabilities for each tweet.)

With your new A and B groups, you notice more differentiating factors: group A people tend to carry along cameras, and group B people tend to be more finance-savvy.

So you start to put camera-carrying folks in group A, and finance-savvy folks in group B.

And so on. Eventually, you settle on two groups of people and differentiating activities: people who walk slowly and visit the Statue of Liberty, and busy-looking people who walk fast and don’t visit. Assuming there are more native New Yorkers than tourists, you can then guess that the natives are the larger group.

Results

I wrote some Ruby code to implement the above algorithm, and trained it on half a million tweets, using English and “not English” as my two languages. The results looked surprisingly good from just eyeballing:

But in order to get some hard metrics and to tune parameters (e.g., n-gram size), I needed a labeled dataset. So I pulled a set of English-language and Spanish-language documents from Project Gutenberg, and split them to form training and test sets (the training set consisted of 2000 lines of English and 1000 lines of Spanish, and 1000 lines of English and 1000 lines of Spanish for the test set).

Trained on bigrams, the detector resulted in:

991 true positives (English lines correctly classified as English)
9 false negatives (English lines incorrectly classified as Spanish
11 false positives (Spanish lines incorrectly classified as English)
989 true negatives (Spanish lines correctly classified as English)

for a precision of 0.989 and a recall of 0.991.

Trained on trigrams, the detector resulted in:

992 true positives
8 false negatives
10 false positives
990 true negatives

for a precision of 0.990 and a recall of 0.992.

Also, when I looked at the sentences the detector was making errors on, I saw that they almost always consisted of only one or two words (e.g., the incorrectly classified sentences were lines like “inmortal”, “autumn”, and “salir”). So the detector pretty much never made a mistake on a normal sentence!

Code/Demo

I put the code on my Github account, and a quick demo app, trained on trigrams from tweets with lang=”en” according to the Twitter API, is here.

Choosing a Machine Learning Classifier

How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you’re simply looking for a “good enough” algorithm for your problem, or a place to start, here are some general guidelines I’ve found to work well over the years.

How large is your training set?

If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren’t powerful enough to provide accurate models.

You can also think of this as a generative model vs. discriminative model distinction.

Advantages of some particular algorithms

Advantages of Naive Bayes: Super simple, you’re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well. Its main disadvantage is that it can’t learn interactions between features (e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together).

Advantages of Logistic Regression: Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

Advantages of Decision Trees: Easy to interpret and explain (for some people – I’m not sure I fall into this camp). They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don’t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.

Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive, hard to interpret, and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.

But…

Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, then whichever classification algorithm you use might not matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).

And to reiterate what I said above, if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize (and Middle Earth), just use an ensemble method to choose them all.

Kickstarter Data Analysis: Success and Pricing

Kickstarter is an online crowdfunding platform for launching creative projects. When starting a new project, project owners specify a deadline and the minimum amount of money they need to raise. They receive the money (less a transaction fee) only if they reach or exceed that minimum; otherwise, no money changes hands.

What’s particularly fun about Kickstarter is that in contrast to that other microfinance site, Kickstarter projects don’t ask for loans; instead, patrons receive pre-specified rewards unique to each project. For example, someone donating money to help an artist record an album might receive a digital copy of the album if they donate 20 dollars, or a digital copy plus a signed physical cd if they donate 50 dollars.

There are a bunch of neat projects, and I’m tempted to put one of my own on there soon, so I thought it would be fun to gather some data from the site and see what makes a project successful.

Ending Soon

The categories section really only provides a history of successful projects, though, so to get some data on unsuccessful projects as well, I took a look at the Ending Soon section of projects whose deadlines are about to pass.

It looks like about 50% of all Kickstarter projects get successfully funded by the deadline:

Interestingly, most of the final funding seems to happen in the final few days: with just 5 days left, only about 20% of all projects have been funded. (In other words, with just 5 days left, 60% of the projects that will eventually be successful are still unfunded.) So the approaching deadline seems to really spur people to donate. I wonder if it’s because of increased publicity in the final few days (the project owners begging everyone for help!) or if it’s simply procrastination in action (perhaps people want to wait to see if their donation is really necessary)?

Lesson: if you’re still not fully funded with only a couple days remaining, don’t despair.

Success vs. Failure

What factors lead a project to succeed? Are there any quantitative differences between projects that eventually get funded and those that don’t?

Two simple (if kind of obvious) things I noticed are that unsuccessful projects tend to require a larger amount of money:

and unsuccessful projects also tend to raise less money in absolute terms (i.e., it’s not just that they ask for too much money to reach their goal – they’re simply not receiving enough money as well):

Not terribly surprising, but it’s good to confirm (and I’m still working on finding other predictors).

Pledge Rewards

There’s a lot of interesting work in behavioral economics on pricing and choice – for example, the anchoring effect suggests that when building a menu, you should include an expensive item to make other menu items look reasonably priced in comparison, and the paradox of choice suggests that too many choices lead to a decision freeze – so one aspect of the Kickstarter data I was especially interested in was how pricing of rewards affects donations. For example, does pricing the lowest reward at 25 dollars lead to more money donated (people don’t lowball at 5 dollars instead) or less money donated (25 dollars is more money than most people are willing to give)? And what happens if a new reward at 5 dollars is added – again, does it lead to more money (now people can donate something they can afford) or less money (the people that would have paid 25 dollars switch to a 5 dollar donation)?

First, here’s a look at the total number of pledges at each price. (More accurately, it’s the number of claimed rewards at each price.) [Update: the original version of this graph was wrong, but I’ve since fixed it.]

Surprisingly, 5 dollar and 1 dollar donations are actually not the most common contribution.

To investigate pricing effects, I started by looking at all (successful) projects that had a reward priced at 1 dollar, and compared the number of donations at 1 dollar with the number of donations at the next lowest reward.

Up to about 15-20 dollars, there’s a steady increase in the proportion of people who choose the second reward over the first reward, but after that, the proportion decreases.

So this perhaps suggests that if you’re going to price your lowest reward at 1 dollar, your next reward should cost roughly 20 dollars (or slightly more, to maximize your total revenue). Pricing above 20 dollars is a little too expensive for the folks who want to support you, but aren’t rich enough to throw gads of money; maybe rewards below 20 dollars aren’t good enough to merit the higher donation.

Next, I’m planning on digging a little deeper into pricing effects and what makes a project successful, so I’ll hopefully have some more Kickstarter analysis in a future post. In the meantime, in case anyone else wants to take a look, I put the data onto my Github account.

A Mathematical Introduction to Least Angle Regression

(For a layman’s introduction, see here.)

Least Angle Regression (aka LARS) is a model selection method for linear regression (when you’re worried about overfitting or want your model to be easily interpretable). To motivate it, let’s consider some other model selection methods:

Forward selection starts with no variables in the model, and at each step it adds to the model the variable with the most explanatory power, stopping if the explanatory power falls below some threshold. This is a fast and simple method, but it can also be too greedy: we fully add variables at each step, so correlated predictors don’t get much of a chance to be included in the model. (For example, suppose we want to build a model for the deliciousness of a PB&J sandwich, and two of our variables are the amount of peanut butter and the amount of jelly. We’d like both variables to appear in our model, but since amount of peanut butter is (let’s assume) strongly correlated with the amount of jelly, once we fully add peanut butter to our model, jelly doesn’t add much explanatory power anymore, and so it’s unlikely to be added.)
Forward stagewise regression tries to remedy the greediness of forward selection by only partially adding variables. Whereas forward selection finds the variable with the most explanatory power and goes all out in adding it to the model, forward stagewise finds the variable with the most explanatory power and updates its weight by only epsilon in the correct direction. (So we might first increase the weight of peanut butter a little bit, then increase the weight of peanut butter again, then increase the weight of jelly, then increase the weight of bread, and then increase the weight of peanut butter once more.) The problem now is that we have to make a ton of updates, so forward stagewise can be very inefficient.

LARS, then, is essentially forward stagewise made fast. Instead of making tiny hops in the direction of one variable at a time, LARS makes optimally-sized leaps in optimal directions. These directions are chosen to make equal angles (equal correlations) with each of the variables currently in our model. (We like peanut butter best, so we start eating it first; as we eat more, we get a little sick of it, so jelly starts looking equally appetizing, and we start eating peanut butter and jelly simultaneously; later, we add bread to the mix, etc.)

In more detail, LARS works as follows:

Assume for simplicity that we’ve standardized our explanatory variables to have zero mean and unit variance, and that our response variable also has zero mean.
Start with no variables in your model.
Find the variable $ x_1 $ most correlated with the residual. (Note that the variable most correlated with the residual is equivalently the one that makes the least angle with the residual, whence the name.)
Move in the direction of this variable until some other variable $ x_2 $ is just as correlated.
At this point, start moving in a direction such that the residual stays equally correlated with $ x_1 $ and $ x_2 $ (i.e., so that the residual makes equal angles with both variables), and keep moving until some variable $ x_3 $ becomes equally correlated with our residual.
And so on, stopping when we’ve decided our model is big enough.

For example, consider the following image (slightly simplified from the original LARS paper; $x_1, x_2$ are our variables, and $y$ is our response):

Our model starts at $ \hat{\mu_0} $.

The residual (the green line) makes a smaller angle with $ x_1 $ than with $ x_2 $, so we start moving in the direction of $ x_1 $. At $ \hat{\mu_1} $, the residual now makes equal angles with $ x_1, x_2 $, and so we start moving in a new direction that preserves this equiangularity/equicorrelation.
If there were more variables, we’d change directions again once a new variable made equal angles with our residual, and so on.

So when should you use LARS, as opposed to some other regularization method like lasso? There’s not really a clear-cut answer, but LARS tends to give very similar results as both lasso and forward stagewise (in fact, slight modifications to LARS give you lasso and forward stagewise), so I tend to just use lasso when I do these kinds of things, since the justifications for lasso make a little more sense to me. In fact, I don’t usually even think of LARS as a model selection method in its own right, but rather as a way to efficiently implement lasso (especially if you want to compute the full regularization path).

Introduction to Cointegration and Pairs Trading

Introduction

Suppose you see two drunks (i.e., two random walks) wandering around. The drunks don’t know each other (they’re independent), so there’s no meaningful relationship between their paths.

But suppose instead you have a drunk walking with her dog. This time there is a connection. What’s the nature of this connection? Notice that although each path individually is still an unpredictable random walk, given the location of one of the drunk or dog, we have a pretty good idea of where the other is; that is, the distance between the two is fairly predictable. (For example, if the dog wanders too far away from his owner, she’ll tend to move in his direction to avoid losing him, so the two stay close together despite a tendency to wander around on their own.) We describe this relationship by saying that the drunk and her dog form a cointegrating pair.

In more technical terms, if we have two non-stationary time series X and Y that become stationary when differenced (these are called integrated of order one series, or I(1) series; random walks are one example) such that some linear combination of X and Y is stationary (aka, I(0)), then we say that X and Y are cointegrated. In other words, while neither X nor Y alone hovers around a constant value, some combination of them does, so we can think of cointegration as describing a particular kind of long-run equilibrium relationship. (The definition of cointegration can be extended to multiple time series, with higher orders of integration.)

Other examples of cointegrated pairs:

Income and consumption: as income increases/decreases, so too does consumption.
Size of police force and amount of criminal activity
A book and its movie adaptation: while the book and the movie may differ in small details, the overall plot will remain the same.
Number of patients entering or leaving a hospital

An application

So why do we care about cointegration? In quantitative finance, cointegration forms the basis of the pairs trading strategy: suppose we have two cointegrated stocks X and Y, with the particular (for concreteness) cointegrating relationship X - 2Y = Z, where Z is a stationary series of zero mean. For example, X could be McDonald’s, Y could be Burger King, and the cointegration relationship would mean that X tends to be priced twice as high as Y, so that when X is more than twice the price of Y, we expect X to move down or Y to move up in the near future (and analogously, if X is less than twice the price of Y, we expect X to move up or Y to move down). This suggests the following trading strategy: if X - 2Y > d, for some positive threshold d, then we should sell X and buy Y (since we expect X to decrease in price and Y to increase), and similarly, if X - 2Y < -d, then we should buy X and sell Y.

Spurious regression

But why do we need the notion of cointegration at all? Why can’t we simply use, say, the R-squared between X or Y to see if X and Y have some kind of relationship? The reason is that standard regression analysis fails when dealing with non-stationary variables, leading to spurious regressions that suggest relationships even when there are none.

For example, suppose we regress two independent random walks against each other, and test for a linear relationship. A large percentage of the time, we’ll find high R-squared values and low p-values when using standard OLS statistics, even though there’s absolutely no relationship between the two random walks. As an illustration, here I simulated 1000 pairs of random walks of length 100, and found p-values less than 0.05 in 77% of the cases:

A Cointegration Test

So how do you detect cointegration? There are several different methods, but the simplest is the Engle-Granger test, which works roughly as follows:

Check that $ X_t $ and $ Y_t $ are both I(1).
Estimate the cointegrating relationship $ Y_t = aX_t + e_t $ by ordinary least squares.
Check that the cointegrating residuals $ e_t $ are stationary (say, by using a so-called unit root test, e.g., the Dickey-Fuller test).

Error-correction and Granger representation

Something else that should perhaps be mentioned is the relationship between cointegration and error-correction mechanisms: suppose we have two cointegrated series $ X_t, Y_t $, with autoregressive representations

$ X_t = a X_{t-1} + b Y_{t-1} + u_t $ $ Y_t = c X_{t-1} + d Y_{t-1} + v_t $

By the Granger representation theorem (which is actually a bit more general than this), we then have

$ \Delta X_t = \alpha_1 (Y_{t-1} - \beta X_{t-1}) + u_t $ $ \Delta Y_t = \alpha_2 (Y_{t-1} - \beta X_{t-1}) + v_t $

where $ Y_{t-1} - \beta X_{t-1} \sim I(0) $ is the cointegrating relationship. Regarding $ Y_{t-1} - \beta X_{t-1} $ as the extent of disequilibrium from the long-run relationship, and the $ \alpha_i $ as the speed (and direction) at which the time series correct themselves from this disequilibrium, we can see that this formalizes the way cointegrated variables adjust to match their long-run equilbrium.

Summary

So, just to summarize a bit, cointegration is an equilibrium relationship between time series that individually aren’t in equilbrium (you can kind of contrast this with (Pearson) correlation, which describes a linear relationship), and it’s useful because it allows us to incorporate both short-term dynamics (deviations from equilibrium) and long-run expectations (corrections to equilibrium).

Counting Clusters

Given a set of datapoints, we often want to know how many clusters the datapoints form. The gap statistic and the prediction strength are two practical algorithms for choosing the number of clusters.

Gap Statistic

The gap statistic algorithm works as follows:

For each i from 1 up to some maximum number of clusters,

Run a k-means algorithm on the original dataset to find i clusters, and sum the distance of all points from their cluster mean. Call this sum the dispersion.
Generate a set of reference datasets (of the same size as the original). One simple way of generating a reference dataset is to sample uniformly from the original dataset’s bounding rectangle; a more sophisticated approach is take into account the original dataset’s shape by sampling, say, from a rectangle formed from the original dataset’s principal components.
Calculate the dispersion of each of these reference datasets, and take their mean.
Define the ith gap by: log(mean dispersion of reference datasets) - log(dispersion of original dataset).

Once we’ve calculated all the gaps (we can add confidence intervals as well; see the original paper for the formula), we can select the number of clusters to be the one that gives the maximum gap. (Sidenote: I view the gap statistic as a very statistical-minded algorithm, since it compares the original dataset against a set of reference “control” datasets.)

For example, here I’ve generated three Gaussian clusters:

And running the gap statistic algorithm, we see that it correctly detects the number of clusters to be three:

For a sample R implementation of the gap statistic, see the Github repository here.

Prediction Strength

Another cluster-counting algorithm is the prediction strength algorithm. In contrast to the gap statistic (which, as mentioned above, I find very statistically), I see prediction strength as taking a more machine learning viewpoint, since it’s formulated as a supervised learning problem validated against a test set.

To calculate prediction strength, for each i from 1 up to some maximum number of clusters:

Divide the dataset into two groups, a training set and a test set.
Run a k-means algorithm on each set to find i clusters.
For each test cluster, count the proportion of pairs of points in that cluster that would remain in the same cluster, if each were assigned to its closest training cluster mean.
The minimum over these proportions is the prediction strength for i clusters.

Once we’ve calculated the prediction strength for each number of clusters, we select the number of clusters to be the maximum i such that the prediction strength for i is greater than some threshold. (The paper suggests 0.8 - 0.9 as a good threshold, and I’ve seen 0.8 work well in practice.)

Here’s the prediction strength algorithm run on the same example above:

Again, check out a sample R implementation of the prediction strength here.

In practice, I tend to prefer using the gap statistic algorithm, since it’s a little easier to code and it doesn’t require selecting an arbitrary threshold like the prediction strength does. I’ve also found that it gives slightly better results (though the original prediction strength paper has the opposite finding).

Appendix

I ended up giving a brief description of two very common clustering algorithms, k-means and Gaussian mixture models in the comments, so I figured I might as well bring them up here.

k-means algorithm

Suppose we have a set of datapoints that we want to cluster. We want to learn two things:

A description of the clusters themselves (so that if new points come in, we can assign them to a cluster).
Which clusters our current points fall into.

We start by initializing k cluster centers (e.g., by randomly choosing k points among our datapoints). Then we repeatedly

Step A: Assign each datapoint to the nearest cluster center.
Step B: Update all the cluster centers: for each cluster i, take the mean over all points currently in the cluster, and update cluster center i to be this mean.
(Repeat steps A and B above until the cluster assignments stop changing.)

And that’s pretty much it for k-means.

k-means from an EM point of View

To ease the transition into Gaussian mixture models, let’s also describe the k-means algorithm using EM language.

Note that if we knew for certain either 1) the exact cluster centers or 2) the cluster each point belonged to, we could trivially solve k-means, since

If we knew the exact cluster centers, all we’d have to do is assign each point to its nearest cluster center, and we’d be done.
If we knew which cluster each point belonged to, we could pick the cluster center by simply taking the mean over all points in that cluster.

The problem is that we know neither of these, and so we alternate between making educated guesses of each one:

In A step above, we pretend that we know the cluster centers, and based off this pretense, we guess which cluster each point belongs to. (This is also known as the E step in the EM algorithm.)
In the B step above, we do the reverse: we pretend that we know which cluster each point belongs to, and then try to guess the cluster centers. (This is also known as the M step in EM.)

Our guesses keep getting better and better, and eventually we’ll converge.

Gaussian Mixture Models

k-means has a hard notion of clustering: point X either belongs to cluster C or it doesn’t. But sometimes we want a soft notion instead: point X belongs to cluster C with probability p (according to a Gaussian kernel). This is where Gaussian mixture modeling comes in.

To run a GMM, we start by initializing $k$ Gaussians (say, by randomly choosing $k$ points to be the centers of the Gaussians and by setting the variance of each Gaussians to be the overall variance), and then we repeatedly:

E Step: Pretend we know the parameters of each Gaussian cluster, and assign each datapoint to Gaussian cluster i with appropriate probability.
M Step: Pretend we know the probabilities that each point belongs to a given cluster. Using these probabilities, update the means and variances of each Gaussian cluster: the new mean for cluster i is the weighted mean over all points (where the weight of each point X is the probability that X belongs to cluster i), and similarly for the new variance.

This is exactly like k-means in the EM formulation, except we replace the binary clustering formula with Gaussian kernels.

Hacker News Analysis

I was playing around with the Hacker News database Ronnie Roller made (thanks!), so I thought I’d post some of the things I looked at.

Activity on the Site

My first question was how activity on the site has increased over time. I looked at number of posts, points on posts, comments on posts, and number of users.

Posts

Hacker News Posts by Month

This looks like a strong linear fit, with an increase of 292 posts every month.

Comments

For comments, I fit a quadratic regression:

Hacker News Comments by Month

Points

A quadratic regression was also a better fit for points by month:

Hacker News Points by Month

Users

And again for the number of distinct users with a submission:

Hacker News Users by Month

Points and Comments

My next question was how points and comments related. Intuitively, posts with more points should have more comments, but it’s nice to check (maybe really good posts are kind of boring, so don’t lead to much discussion).

First, I plotted the points and comments of each individual post:

All Points vs. Comments

As expected, there’s an overall positive correlation between points and comments. Interestingly, there are quite a few high-points posts with no comments.

The plot’s quite noisy, though, so let’s try cleaning it up a bit, by taking the median number of comments per points level (and removing posts at the higher end, where we have little data):

Points vs. Median Comments

We see that posts with more points do tend to have more comments. Also, variance in number of comments is indicated by size and color, so (unsurprisingly) posts with more points have larger variance in their number of comments.

Quality of Posts

Another question was whether the quality of posts has degraded over time.

First, I computed a normalized “score” for each post, where a post’s score is defined as the number of points divided by the number of distinct users who made a submission in the same month. (The denominator is a rough proxy for the number of active users, and the goal of the score is to provide a way to compare posts across time.)

While the median score has declined over time (as perhaps should be expected, since only a fixed number of items can reach the front page):

the absolute number of quality posts, defined as posts with a score greater than the (admittedly arbitrarily chosen) threshold 0.01, has increased (until possibly a dip starting in 2010):

Number of Quality Posts

(Of course, without some further analysis, it’s not clear how well this score measures quality of posts, so take these numbers with a grain of salt.)

Company Trends

Also, I wanted to see how certain topics have trended over time, so I looked at how mentions of some of the big-name companies (Google, Facebook, Microsoft, Yahoo, Twitter, Apple) have changed. For each company, I plotted the percentage of posts with the company’s name in the title, and also made a smoothed plot comparing all six at the end. Note that Microsoft and Yahoo seem to be trending slightly downward, and Apple seems to be trending upward.

Mentions of Microsoft

Mentions of Yahoo

Mentions of Google

Mentions of Facebook

Mentions of Twitter

Mentions of Apple

All Trends

Layman's Introduction to Measure Theory

Measure theory studies ways of generalizing the notions of length/area/volume. Even in 2 dimensions, it might not be clear how to measure the area of the following fairly tame shape:

What's the area of this shape?

much less the “area” of even weirder shapes in higher dimensions or different spaces entirely.

For example, suppose you want to measure the length of a book (so that you can get a good sense of how long it takes to read). What’s a good measure? One possibility is to measure a book’s length in pages. Since books provide page counts, this is a fairly easy measure to get. However, different versions of the same book (e.g., hardcover and paperback versions) tend to have different page counts, so this page measure doesn’t satisfy the nice property of version invariance (which we would like to have, since hardcover and paperback versions of the same book take the same time to read). Also, not all books even have page counts (think Kindle books), so this measure doesn’t allow us to measure the length of all books we might want to read.

Another, possibly better measure is to measure a book’s length in terms of the number of words it contains. Now we do have version invariance (hardcover and paperback versions contain the same number of words) and we can measure the length of Kindle books as well. We can even do things like add two books together, and the measure/number of words of the concatenated books will pleasantly equal the sum of the measures/number of words of each book alone.

However, what happens when we try to measure a picture book’s length in words? We can’t – picture books are too pathological. Maybe we could say that a picture book has measure zero (since a picture book has no words), but then we get unhappy things like books of measure zero taking a really long time to read (imagine a really long picture book). So maybe a better option is to say that picture books are simply unmeasurable. Whenever someone asks for the length of a picture book, we ignore them, and this way our measure will continue to be a good approximation of reading time and we get to keep our other nice properties as well.

Similarly, measure theory asks questions like:

How do we define a measure on our space? (Jordan measure and Lebesgue measure are two different options in Euclidean space.)
What properties does our measure satisfy? (For example, does it satisfy translational invariance, rotational invariance, additivity?)
Which objects are measurable/which objects can we say it’s okay not to measure in order to preserve nice properties of our measure? (The Banach-Tarski ball can be rigidly reassembled into two copies of the same shape and size as the original, so we don’t want it to be measurable, since then we would lose additivity properties.)

And once we’ve defined a “generalized area” (our measure), we can try to generalize other mathematical concepts as well. For example, recall that the (Riemann) integral that you learn in calculus measures the area under a curve. What happens if we replace the “area” in the Riemann integral with our new, generalized measure (e.g., to get the Lebesgue integral)? Measure theory also helps make certain probability statements mathematically precise (e.g., we can say exactly what it means that a fair coin flipped infinitely often will “almost never” land heads more than 50% of the time).

Layman's Introduction to Random Forests

Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.

Thus, Willow is a decision tree for your movie preferences.

But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends, and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).

Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you really really loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (formally, you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don’t mention Harry Potter at all.

By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.

There’s still one problem with your data, however. While you loved both Titanic and Inception, it wasn’t because you like movies that star Leonardio DiCaprio. Maybe you liked both movies for other reasons. Thus, you don’t want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you’re building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren’t allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you’re injecting randomness at the model level, by making your friends ask different questions at different times.

And so your friends now form a random forest.

Netflix Prize Summary: Factorization Meets the Neighborhood

(Way back when, I went through all the Netflix prize papers. I’m now (very slowly) trying to clean up my notes and put them online. Eventually, I hope to have a more integrated tutorial, but here’s a rough draft for now.)

This is a summary of Koren’s 2008 Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model.

There are two approaches to collaborative filtering: neighborhood methods and latent factor models.

Neighborhood models are most effective at detecting very localized relationships (e.g., that people who like X-Men also like Spiderman), but poor at detecting a user’s overall signals.
Latent factor models are best at estimating overall structure (e.g., that a user likes horror movies), but are poor at detecting strong associations among small sets of closely related items.

Since the two approaches have complementary strengths and weaknesses, we should integrate the two; this integration is the focus of this paper.

Preliminaries

As mentioned in previous papers, we should normalize out common effects from movies. Throughout the rest of this paper, Koren uses a baseline estimate of overall rating mean + user deviation from average + movie deviation from average for the rating of user i on movie i; estimation of the latter two parameters are done by solving a regularized least squares problem.

Koren then describes using a binary matrix (1 for rated, 0 for not rated) as a source of implicit feedback. This is useful because the mere fact that a user rated many science fiction movies (say) suggests that the user likes science fiction movies.

A Neighborhood Model

Recall the previous paper, where we modeled each rating $r_{ui}$ as

$$r_{ui} = b_{ui}+ \sum_{N \in N(i; u)} (r_{uj} - b_{uj}) w_{ij},$$

where $N(i; u)$ is the k items most similar to i among the items user u rated, and the $w_{ij}$ are parameters to be learned by solving a regularized least squares problem.

This paper makes several enhancements to that model. First, we replace $N(i; u)$ with $R^k(i; u)$, the intersection of the k items most similar to i (among all items) intersected with the items user u rated. Also, we denote by $N^k(i; u)$ the intersection of the k items most similar to i with the items user u has provided implicit feedback for. This gives us

$$r_{ui} = b_{ui} + \sum_{j \in R^k(i; u)} (r_{uj} - b_{uj}) w_{ij} + \sum_{j \in N^k(i; u)} c_{ij},$$

where the $c_{ij}$ are another set of parameters to learn.

Notice that by taking the intersection of the k items most similar to i with the items user u rated (giving perhaps a set of size less than k), rather than taking the k items most similar to i among the items user u rated, we let our model be influenced not only by what a user rates, but also by what a user does not rate. For example, if a user does not rate LOTR 1 or LOTR 2, his predicted rating for LOTR 3 is penalized.

This implies that our current model encourages greater deviations from baseline estimates for users that provided many ratings or plenty of implicit feedback. In other words, for well-modeled users with a lot of input, we are willing to predict quirkier and less common recommendations; users we have less information about, on the other hand, receive safer, baseline estimates.

Nonetheless, this dichotomy between power users and newbie users is perhaps overemphasized by our current model, so we moderate the dichotomy by modifying our model to be

$$r_{ui} = b_{ui} + |R^k(i; u)|^{-0.5} \sum_{j \in R^k(i; u)} (r_{uj} - b_{uj}) w_{ij} + |N^k(i; u)|^{-0.5} \sum_{j \in N^k(i; u)} c_{ij}.$$

Parameters are determined by solving a regularized least squares problem.

Latent Factor Models Revisited

Typical SVD approaches are based on the following rule:

$$r_{ui} = b_{ui} + p_u^T q_i,$$

where $p_u$ is a user-factors vector and $q_i$ is an item-factors vector. We describe two enhancements.

Asymmetric-SVD

One suggestion is to replace $p_u$ with

$$|R(u)|^{-0.5} + \sum_{j \in R(u)} (r_{uj} - b_{uj}) x_j + |N(u)|^{-0.5} \sum_{j \in N(u)} y_j,$$

where $R(u)$ is the set of items user u has rated, and $N(u)$ is the set of items user u has provided implicit feedback for. In other words, this model represents users through the items they prefer, rather than expressing users in a latent feature space. This model has several advantages:

Asymmetric-SVD does not parameterize users, so we do not need to wait to retrain the model when a user comes in. Instead, we can handle new users as soon as they provide feedback.
Predictions are a direct function of past feedback, so we can easily explain predictions. (When using a pure latent feature solution, however, explainability is difficult.)

As usual, parameters are learned via a regularized least-squares minimization.

SVD++

Another approach is to continue modeling users as latent features, while adding implicit feedback. Thus, we replace $p_u$ with $p_u + |N(u)|^{-0.5} \sum_{j \in N(u)} y_j$. While we lose the easily explainability and immediate feedback of the Asymmetric-SVD model, this approach is likely more accurate.

An Integrated Model

An integrated model incorporating baseline estimates, the neighborhood approach, and the latent factor approach is as follows:

$$r_{ui} = \left[\mu + b_u + b_i\right] +\left[q_i^T \big(p_u + \sqrt{|N(u)|}\sum_{j \in N(u)} y_j \big)\right] + \left[\sqrt{|R^k(i;u)} \sum_{j \in R^k(i; u)}(r_{uj} - b_{uj})w_{ij}+\sqrt{|N^k(i;u)|} \sum_{j \in N^k(i; u)} c_{ij}\right].$$

Note that we have used $(\mu + b_u + b_i)$ as our baseline estimate. We also used the SVD++ model, but we could use the Asymmetric-SVD model instead.

This rule provides a 3-tier model for recommendations:

The first baseline group describes general properties of the item and user. For example, it may say that “The Sixth Sense” movie is known to be a good movie in general, and that Joe rates like the average user.
The next latent factor group may say that since “The Sixth Sense” and Joe rate high on the Psychological Thrillers Scale, Joe may like The Sixth Sense because he likes this genre of movies in general.
The final neighborhood tier makes fine-grained adjustments that are hard to file, such as the fact that Joe rated low the movie “Signs”, a similar psychological thriller by the same director.

As usual, model parameters are determined by minimizing the regularized squared error function through gradient descent.

Netflix Prize Summary: Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

This is a summary of Bell and Koren’s 2007 Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights paper.

tl;dr This paper’s main innovation is deriving neighborhood weights by solving a least squares problem, instead of using a standard similarity function to compute weights.

This paper improves upon the standard neighborhood approach to collaborative filtering in three areas: better data normalization, better neighbor weights (this is the key section), and better use of user data. I’ll first review the standard neighborhood approach, and follow with a description of these enhancements.

Background: Standard Neighborhood Approach to Collaborative Filtering

Recall that there are two types of neighborhood approaches:

User-based approaches: to predict user i’s rating of item j, take the users most similar to user i, and perform a weighted average of their ratings of item j.
Item-based approaches: to predict user i’s rating of item j, perform a weighted average of user i’s ratings of items similar to item j.

For example, to predict how you would rate the first Harry Potter movie, the user-based approach looks at how your friends rated the first Harry Potter movie, while the item-based approach looks at how you rated movies like Lord of the Rings and Twilight.

Better Data Normalization

Suppose I ask my friend Chris whether I should watch the latest Twilight movie. He tells me he would rate it 4.0/5 stars. Great, that’s a high rating, so that means I should watch it – or does it? It turns out that Chris is a super cheerful guy who’s never met a movie he didn’t like, and his average rating for a movie is actually 4.5/5 stars. So Twilight is actually less than average for him, and hence 4.0/5 stars from Chris isn’t actually that hearty a recommendation.

As another example, suppose you look at doctor ratings on Yelp. They’re abnormally high: the average is far from 3/5 stars. Why is this? Maybe it’s harder for people to change doctors than it is to go to a new restaurant, so people might not want to rate a doctor poorly when they know they’ll have to see the doctor again. Thus, an average rating of 5 stars on a McDonalds restaurant is much more impressive than an average of 5 stars on Dr. Joe.

The lesson is that when using existing ratings, we should normalize out these types of effects, so that ratings are as comparable as possible.

Another way of thinking about this is that we are simply building a regression model. That is, for each user u, we have a model $r_{ui} = (\sum \theta_u x_{ui}) + SpecificRating$, where the $x_{ui}$ are common explanatory variables and we want to estimate $\theta_u$; and similarly for each item i. Once we’ve estimated the $\theta_u$, we can use the fancier neighborhood models on the specific ratings.

For example, suppose we want to predict Bob’s rating of Titanic. We’ve built a regression model with two explanatory variables, whether the movie was Oscar-nominated (1 if so, -1 if not) and whether the movie contains Kate Winslet (1 if so, -1 if not), and we’ve determined that Bob’s weights on these two variables are -2 (Bob tends to hate Oscar movies) and +1.5 (Bob likes Kate Winslet). Similarly, his friend John has weights +1 and -0.5 for these two variables (John likes Oscars, but dislikes Kate Winslet). So if we know that John rated Titanic a 4, then we have 4 = 1(1) + -0.5(1) + (John’s specific rating), so John’s specific rating of Titanic is 3.5. If we use John’s rating alone to estimate Bob’s, we might guess that Bob would rate Titanic -2(1) + 1.5(1) + (John’s specific rating) = 3.0.

To estimate the $\theta_u$, we actually perform this estimation in sequence: each explanatory variable is used to model the residual from the previous explanatory variable. Also, instead of using the maximum-likelihood unbiased estimator $\hat{\theta_u} = \frac{\sum r_{ui} x_{ui}}{x _ {ui} ^ 2}$, we shrink the weights to prevent overfitting. From a Bayesian point of view, the shrinkage arises from a hierarchical model where the true $\theta_u \sim N(\mu, \sigma^2)$, and $\hat{\theta_u} | \theta_u \sim N(\theta_u, \sigma_u^2)$, leading to $E(\theta_u | \hat{\theta_u}) = \frac{\sigma^2 \hat{\theta_u} + \sigma_u^2 \mu}{\sigma^2 + \sigma_u^2}$.

In practice, the explanatory variables Bell and Koren found to work well included the overall mean of all ratings, each movie’s specific mean, each user’s specific mean, time since movie release, time since user join, and number of ratings for each movie.

Better Neighbor Weights

Let’s consider some deficiencies of the neighborhood approach:

Suppose I want to use the first LOTR movie to predict ratings of the first Harry Potter movie. To do this, I need to say how much weight the first LOTR movie should have in this prediction. But how do I choose this weight? Standard neighborhood approaches essentially pick arbitrary similarity functions (e.g., Pearson correlation, cosine distance) as the weight, possibly testing several similarity functions to see which gives the best performance, but is there a more principled approach to choosing weights?
The standard neighborhood approach ignores the fact that neighbors aren’t independent. For example, suppose all three LOTR movies are neighbors of the first HP movie. Since the three LOTR movies are so similar to each other, the standard approach is overcounting their information. Here’s an analogy: suppose I ask five of my friends where I should eat tonight. Three of them live together (boyfriend, girlfriend, and roommate), and they all recently took a trip together to Japan and are sick of Japanese food, so they vehemently recommend against sushi. Thus, my friends’ recommendations have a stronger bias than would appear if I asked five friends who didn’t know each other at all.

We’ll see how using an optimization method to derive weights (as opposed to deriving weights via a similarity function) overcomes these two limitations.

Recall our problem: we want to predict $r_{ui}$, user u’s rating of item i, and what we have is a set $N(i; u)$ of K neighbors of item i that user u has also rated. (These K neighbors are selected via a similarity function, as is standard.) So what we want to do is find weights $w_{ij}$ such that $r_{ui} = \sum_{j \in N(i; u) w_{ij} r_{uj}}$. A natural approach, then, is simply to choose our weights to minimize $\min_w \sum_{v \neq u} \left( r_{vi} - \sum_{j \in N(i; u)} w_{ij} r_{vj}\right)^2$.

Notice how this optimization solves our two problems above: it’s not only a more principled approach (we choose our weights by minimizing squared error), but by deriving weights simultaneously, we overcome interaction effects.

Differentiating our cost function, we find that the optimal weights satisfy the equation $Aw = b$, where A is a $K \times K$ matrix defined by $A_{jk} = \sum_{v \neq u} r_{vj} r_{vk}$ and $b$ is a vector defined by $b_j = \sum_{v \neq u} r_{vj} r_{vi}$.

However, not all users have rated every movie, so some of the ratings may be missing from the above formulas. So we should instead use an estimate of A and b, such as $\bar{A}_{jk} = \frac{\sum_{v \in U(j,k)} r_{vj} r_{vk}}{|U(j, k)|}$, where $U(j, k)$ is the set of users who rated both j and k, and similarly for b. To avoid overfitting, we should further modify by shrinking to a common mean: $\hat{A}_{jk} = \frac{|U(J,K)|\bar{A}_{jk} + \beta A_{\mu}}{|U(j,k)| + \beta}$, where $\beta$ is a shrinkage parameter and $A_{\mu}$ is the mean over all $\bar{A}$, and similarly for b.

Note that another benefit of our optimization-derived weights is that the weights of neighbors are no longer constrained to sum to 1. Thus, if an item simply has no strong neighbors, the neighbors’ prediction will have only a small effect.

Also, when engineering these methods in practice, we should precompute all item-item similarities and all entries in the matrix $A$.

Better Use of User Data

Neighborhood models typically follow the item-based approach for two reasons:

There are typically many more users than items, and new users come in much more frequently than new items, so it is easier to compute all pairs of item-item similarities.
Users have diverse tastes, so they aren’t as similar to each other. For example, Alice and Eve may both like horror movies, but disagree on comedies.

But there are various reasons we might want to use a user-based approach in addition to an item-based approach (say, a user hasn’t rated many items yet, but we can find similar users based on other types of data, such as browsing history; or, we want to predict user u’s rating on item i, but user u hasn’t rated any items similar to i), so let’s see if we can get around these limitations.

To get around the first limitation, we can project users into a lower-dimensional space (say, by using a singular value decomposition), where we can use a space-partitioning data structure (e.g., a kd-tree) or a nearest-neighbor algorithm (e.g., locality sensitive hashing) to find neighboring users.

To get around the second limitation – that a user u may be predictive of user v for some items, but less so for others – we incorporate item-item similarity into our weighting method. That is, when using the user-neighborhood model to predict user u’s rating on item i, we give higher weight to items similar to i, by choosing the weights to minimize $\min_w \sum_{j \neq i} s_{ij} \left( r_{uj} - \sum_{v \in N(u, i)} w_{uv} r_{vj} \right)^2,$ where the $s_{ij}$ are item-item similarities.

Appendix: Shrinkage

Parameter shrinkage is used a couple times in the paper, so let’s explain what it means.

Suppose that we want to estimate the probability of a coin. If we flip it once and see heads, then the maximum-likelihood estimate of heads is 1. But (as is typical for maximum-likelihood estimates), this is severe overfitting, and what we should do instead is shrink this maximum-likelihood estimate to a prior estimate of the probability of heads, say 1/2. (Note that shrinkage doesn’t necessarily mean decreasing the number, just moving it towards a prior estimate).

How should we perform this shrinkage? If our maximum-likelihood estimate of our parameter $\theta$ is $x$ and our prior mean is $\mu$, a natural estimation of $\theta$ is to use a weighted mean $\alpha x + (1 - \alpha)\mu$, where $\alpha$ is some measure of the degree of belief in our maximum likelihood estimate.

This weighted average approach has several interpretations:

We can also view it as a shrinkage of our maximum likelihood estimate to our prior mean: $\alpha x + (1 - \alpha)\mu = x + (1 - \alpha) (\mu - x)$
We can also view it as a Bayesian posterior: if we use a prior $\theta \sim N(\mu, \tau)$ (where $\tau$ is the precision of our Gaussian, not the variance) and a conditional distribution $x | \theta \sim N(\theta, \tau_x)$, then the posterior mean of $\theta$ is $\theta = \frac{\tau_x}{\tau_x + \tau}x + \frac{\tau}{\tau_x + \tau}\mu,$ which is equivalent to the form above.

Prime Numbers and the Riemann Zeta Function

Lots of people know that the Riemann Hypothesis has something to do with prime numbers, but most introductions fail to say what or why. I’ll try to give one angle of explanation.

Layman’s Terms

Suppose you have a bunch of friends, each with an instrument that plays at a frequency equal to the imaginary part of a zero of the Riemann zeta function. If the Riemann Hypothesis holds, you can create a song that sounds exactly at the prime-powered beats, by simply telling all your friends to play at the same volume.

Mathematical Terms

Let $ \pi(x) $ denote the number of primes less than or equal to x. Recall Gauss’s approximation: $ \pi(x) \approx \int\_2\^x \frac{1}{\log t} \,dt $ (aka, the “probability that a number n is prime” is approximately $ \frac{1}{\log n} $).

Riemann improved on Gauss’s approximation by discovering an exact formula $ P(x) = A(x) - E(x) $ for counting the primes, where

$ P(x) = \sum\_{p\^k < x} \frac{1}{k} $ performs a weighted count of the prime powers less than or equal to x. [Think of this as a generalization of the prime counting function.]
$ A(x) = \int\_0\^x \frac{1}{\log t} \,dt+ \int\_x\^{\infty} \frac{1}{t(t\^2 -1) \log t} \,dt $ $ - \log 2 $ is a kind of generalization of Gauss’s approximation.
$ E(x) = \sum\_{z : \zeta(z) = 0} \int\_0\^{x\^z} \frac{1}{\log t} \,dt $ is an error-correcting factor that depends on the zeroes of the Riemann zeta function.

In other words, if we use a simple Gauss-like approximation to the distribution of the primes, the zeroes of the Riemann zeta function sweep up after our errors.

Let’s dig a little deeper. Instead of using Riemann’s formula, I’m going to use an equivalent version

$$ \psi(x) = (x + \sum\_{n = 1}\^{\infty} \frac{x\^{-2n}}{2n} - \log 2\pi) - \sum\_{z : \zeta(z) = 0} \frac{x\^z}{z} $$

where $ \psi(x) = \sum\_{p\^k \le x} \log p $. Envisioning this formula to be in the same $P(x) = A(x) - E(x)$ form as above, this time where

$ P(x) = \psi(x) = \sum\_{p\^k \le x} \log p $ is another kind of count of the primes.
$ A(x) = x + \sum\_{n = 1}\^{\infty} \frac{x\^{-2n}}{2n} - \log 2\pi $ is another kind of approximation to $P(x)$.
$ E(x) = \sum\_{z : \zeta(z) = 0} \frac{x\^z}{z} $ is another error-correction factor that depends on the zeroes of the Riemann zeta function.

we can again interpret it as an error-correcting formula for counting the primes.

Now since $ \psi(x) $ is a step function that jumps at the prime powers, its derivative $ \psi’(x) $ has spikes at the prime powers and is zero everywhere else. So consider

$$ \psi’(x) = 1 - \frac{1}{x(x\^2 - 1)} - \sum\_z x\^{z-1} $$

It’s well-known that the zeroes of the Riemann zeta function are symmetric about the real axis, so the (non-trivial) zeroes come in conjugate pairs $ z, \bar{z} $. But $ x\^{z-1} + x\^{\bar{z} - 1} $ is just a wave whose amplitude depends on the real part of z and whose frequency depends on the imaginary part (i.e., if $ z = a + bi $, then $ x\^{z-1} + x\^{\bar{z}-1} = 2x\^{a-1} cos (b \log x) $), which means $ \psi’(x) $ can be decomposed into a sum of zeta-zero waves. Note that because of the $2x\^{a-1}$ term in front, the amplitude of these waves depends only on the real part $a$ of the conjugate zeroes.

For example, here are plots of $ \psi’(x) $ using 10, 50, and 200 pairs of zeroes:

So when the Riemann Hypothesis says that all the non-trivial zeroes have real part 1/2, it’s hypothesizing that the non-trivial zeta-zero waves have equal amplitude, i.e., they make equal contributions to counting the primes.

In Fourier-poetic terms, when Flying Spaghetti Monster composed the music of the primes, he built the notes out of the zeroes of the Riemann zeta function. If the Riemann Hypothesis holds, he made all the non-trivial notes equally loud.

Topological Combinatorics and the Evasiveness Conjecture

The Kahn, Saks, and Sturtevant approach to the Evasiveness Conjecture (see the original paper here) is an epic application of pure mathematics to computer science. I’ll give an overview of the approach here, and probably try to add some more information on the problem in other posts.

tl;dr The KSS approach provides an algebraic-topological attack to a combinatorial hypothesis, and reduces a graph complexity problem to a problem of contractibility and (not) finding fixed points.

First, the Evasiveness Conjecture states that any (non-trivial) monotone graph property is evasive. In other words, if you’re trying to figure out whether an undirected n-vertex graph satisfies a certain property (e.g., whether the graph contains a triangle or is connected), and this property is monotone (meaning that if you add more edges to the graph, then it still satisfies the property), then if all you’re allowed to do is ask questions of the form “Is edge (i, j) in the graph?”, then you need to query for every single edge before you can determine whether the graph satisfies the property or not. For example, if you want to figure out whether a graph G contains a clique of size 5, then you need to know whether each of the n(n-1)/2 possible edges is in the graph or not before you can answer for certain.

Next, given any monotone graph property on n-vertex graphs, we can associate it with a simplicial complex S (basically, an n-dimensional structure formed by gluing together a bunch of hypertriangles), by taking the complex to be the set of all n-vertex graphs that don’t satisfy the property.

Kahn, Saks, and Sturtevant then prove that if a monotone graph property is not evasive, then its associated simplicial complex is contractible, and thus (by the Lefschetz Fixed-Point theorem) any auto-simplicial map on the complex (a function from the complex to itself that preserves faces) has a fixed point.

Thus, we can prove that a monotone graph property is evasive by finding a simplicial map that has no fixed point (which we can do by showing that no orbit of the map is a face of the complex). This approach has been used to prove things like the evasiveness of graph properties when the number of vertices is prime or a prime power, and the evasiveness of all bipartite graph properties.

Item-to-Item Collaborative Filtering with Amazon's Recommendation System

Introduction

In making its product recommendations, Amazon makes heavy use of an item-to-item collaborative filtering approach. This essentially means that for each item X, Amazon builds a neighborhood of related items S(X); whenever you buy/look at an item, Amazon then recommends you items from that item’s neighborhood. That’s why when you sign in to Amazon and look at the front page, your recommendations are mostly of the form “You viewed… Customers who viewed this also viewed…”.

Other approaches.

The item-to-item approach can be contrasted to:

A user-to-user collaborative filtering approach. This finds users similar to you (e.g., it could find users who bought a lot of items in common with you), and suggest items that they’ve bought but you haven’t.
A global, latent factorization approach. Rather than looking at individual items in isolation (in the item-to-item approach, if you and I both buy a book X, Amazon will make essentially the same recommendations based on X, regardless of what we’ve bought in the past), a global approach would look at all the items you’ve bought, and try to detect properties that characterize what you like. For example, if you buy a lot of science fiction books and also a lot of romance books, a global-approach algorithm might try to recommend you books with both science fiction and romance elements.

Pros/cons of the item-to-item approach:

Pros over the user-to-user approach: Amazon (and most applications) has many more users than items, so it’s computationally simpler to find similar items than it is to find similar users. Finding similar users is also a difficult algorithmic task, since individual users often have a very wide range of tastes, but individual items usually belong to relatively few genres.
Pros over the factorization approach: Simpler to implement. Faster to update recommendations: as soon as you buy a new book, Amazon can make a new recommendation in the item-to-item approach, whereas a factorization approach would have to wait until the factorization has been recomputed. The item-to-item approach can also be more easily leveraged in several areas, not only in the recommendations made to you, but also in the “similar items/other customers also bought” section when you look at a particular item.
Cons of the item-to-item approach: You don’t get very much diversity or surprise in item-to-item recommendations, so recommendations tend to be kind of “obvious” and boring.

How to find similar items

Since the item-to-item approach makes crucial use of similar items, here’s a high-level view of how to do it. First, associate each item with the set of users who have bought/looked at it. The similarity between any two items could then be a normalized measure of the number of users they have in common (i.e., the Jaccard index) or the cosine distance between the two items (imagine each item as a vector, with a 1 in the ith element if user i has bought it, and 0 otherwise).

Page 2 / 2

Edwin Chen

Surge AI CEO: data labeling and RLHF, designed for the next generation of AI.

Need high-quality, human-powered data? We help top AI and LLM companies around the world create powerful, human-labeled datasets.

Ex: AI, data science at Google, Facebook, Twitter, Dropbox, MSR. Pure math and linguistics at MIT.

Surge AI
Surge AI Blog
Surge AI Twitter
Surge AI LinkedIn
Surge AI Github

Twitter
LinkedIn
Github
Quora
Email

Recent Posts