Given a set of datapoints, we often want to know how many clusters the datapoints form. The gap statistic and the prediction strength are two practical algorithms for choosing the number of clusters.

Gap Statistic

The gap statistic algorithm works as follows:

For each i from 1 up to some maximum number of clusters,

Run a k-means algorithm on the original dataset to find i clusters, and sum the distance of all points from their cluster mean. Call this sum the dispersion.
Generate a set of reference datasets (of the same size as the original). One simple way of generating a reference dataset is to sample uniformly from the original dataset’s bounding rectangle; a more sophisticated approach is take into account the original dataset’s shape by sampling, say, from a rectangle formed from the original dataset’s principal components.
Calculate the dispersion of each of these reference datasets, and take their mean.
Define the ith gap by: log(mean dispersion of reference datasets) - log(dispersion of original dataset).

Once we’ve calculated all the gaps (we can add confidence intervals as well; see the original paper for the formula), we can select the number of clusters to be the one that gives the maximum gap. (Sidenote: I view the gap statistic as a very statistical-minded algorithm, since it compares the original dataset against a set of reference “control” datasets.)

For example, here I’ve generated three Gaussian clusters:

And running the gap statistic algorithm, we see that it correctly detects the number of clusters to be three:

For a sample R implementation of the gap statistic, see the Github repository here.

Prediction Strength

Another cluster-counting algorithm is the prediction strength algorithm. In contrast to the gap statistic (which, as mentioned above, I find very statistically), I see prediction strength as taking a more machine learning viewpoint, since it’s formulated as a supervised learning problem validated against a test set.

To calculate prediction strength, for each i from 1 up to some maximum number of clusters:

Divide the dataset into two groups, a training set and a test set.
Run a k-means algorithm on each set to find i clusters.
For each test cluster, count the proportion of pairs of points in that cluster that would remain in the same cluster, if each were assigned to its closest training cluster mean.
The minimum over these proportions is the prediction strength for i clusters.

Once we’ve calculated the prediction strength for each number of clusters, we select the number of clusters to be the maximum i such that the prediction strength for i is greater than some threshold. (The paper suggests 0.8 - 0.9 as a good threshold, and I’ve seen 0.8 work well in practice.)

Here’s the prediction strength algorithm run on the same example above:

Again, check out a sample R implementation of the prediction strength here.

In practice, I tend to prefer using the gap statistic algorithm, since it’s a little easier to code and it doesn’t require selecting an arbitrary threshold like the prediction strength does. I’ve also found that it gives slightly better results (though the original prediction strength paper has the opposite finding).

Appendix

I ended up giving a brief description of two very common clustering algorithms, k-means and Gaussian mixture models in the comments, so I figured I might as well bring them up here.

k-means algorithm

Suppose we have a set of datapoints that we want to cluster. We want to learn two things:

A description of the clusters themselves (so that if new points come in, we can assign them to a cluster).
Which clusters our current points fall into.

We start by initializing k cluster centers (e.g., by randomly choosing k points among our datapoints). Then we repeatedly

Step A: Assign each datapoint to the nearest cluster center.
Step B: Update all the cluster centers: for each cluster i, take the mean over all points currently in the cluster, and update cluster center i to be this mean.
(Repeat steps A and B above until the cluster assignments stop changing.)

And that’s pretty much it for k-means.

k-means from an EM point of View

To ease the transition into Gaussian mixture models, let’s also describe the k-means algorithm using EM language.

Note that if we knew for certain either 1) the exact cluster centers or 2) the cluster each point belonged to, we could trivially solve k-means, since

If we knew the exact cluster centers, all we’d have to do is assign each point to its nearest cluster center, and we’d be done.
If we knew which cluster each point belonged to, we could pick the cluster center by simply taking the mean over all points in that cluster.

The problem is that we know neither of these, and so we alternate between making educated guesses of each one:

In A step above, we pretend that we know the cluster centers, and based off this pretense, we guess which cluster each point belongs to. (This is also known as the E step in the EM algorithm.)
In the B step above, we do the reverse: we pretend that we know which cluster each point belongs to, and then try to guess the cluster centers. (This is also known as the M step in EM.)

Our guesses keep getting better and better, and eventually we’ll converge.

Gaussian Mixture Models

k-means has a hard notion of clustering: point X either belongs to cluster C or it doesn’t. But sometimes we want a soft notion instead: point X belongs to cluster C with probability p (according to a Gaussian kernel). This is where Gaussian mixture modeling comes in.

To run a GMM, we start by initializing $k$ Gaussians (say, by randomly choosing $k$ points to be the centers of the Gaussians and by setting the variance of each Gaussians to be the overall variance), and then we repeatedly:

E Step: Pretend we know the parameters of each Gaussian cluster, and assign each datapoint to Gaussian cluster i with appropriate probability.
M Step: Pretend we know the probabilities that each point belongs to a given cluster. Using these probabilities, update the means and variances of each Gaussian cluster: the new mean for cluster i is the weighted mean over all points (where the weight of each point X is the probability that X belongs to cluster i), and similarly for the new variance.

This is exactly like k-means in the EM formulation, except we replace the binary clustering formula with Gaussian kernels.

(Way back when, I went through all the Netflix prize papers. I’m now (very slowly) trying to clean up my notes and put them online. Eventually, I hope to have a more integrated tutorial, but here’s a rough draft for now.)

This is a summary of Bell and Koren’s 2007 Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights paper.

tl;dr This paper’s main innovation is deriving neighborhood weights by solving a least squares problem, instead of using a standard similarity function to compute weights.

This paper improves upon the standard neighborhood approach to collaborative filtering in three areas: better data normalization, better neighbor weights (this is the key section), and better use of user data. I’ll first review the standard neighborhood approach, and follow with a description of these enhancements.

Background: Standard Neighborhood Approach to Collaborative Filtering

Recall that there are two types of neighborhood approaches:

User-based approaches: to predict user i’s rating of item j, take the users most similar to user i, and perform a weighted average of their ratings of item j.
Item-based approaches: to predict user i’s rating of item j, perform a weighted average of user i’s ratings of items similar to item j.

For example, to predict how you would rate the first Harry Potter movie, the user-based approach looks at how your friends rated the first Harry Potter movie, while the item-based approach looks at how you rated movies like Lord of the Rings and Twilight.

Better Data Normalization

Suppose I ask my friend Chris whether I should watch the latest Twilight movie. He tells me he would rate it 4.0/5 stars. Great, that’s a high rating, so that means I should watch it – or does it? It turns out that Chris is a super cheerful guy who’s never met a movie he didn’t like, and his average rating for a movie is actually 4.5/5 stars. So Twilight is actually less than average for him, and hence 4.0/5 stars from Chris isn’t actually that hearty a recommendation.

As another example, suppose you look at doctor ratings on Yelp. They’re abnormally high: the average is far from 3/5 stars. Why is this? Maybe it’s harder for people to change doctors than it is to go to a new restaurant, so people might not want to rate a doctor poorly when they know they’ll have to see the doctor again. Thus, an average rating of 5 stars on a McDonalds restaurant is much more impressive than an average of 5 stars on Dr. Joe.

The lesson is that when using existing ratings, we should normalize out these types of effects, so that ratings are as comparable as possible.

Another way of thinking about this is that we are simply building a regression model. That is, for each user u, we have a model $r_{ui} = (\sum \theta_u x_{ui}) + SpecificRating$, where the $x_{ui}$ are common explanatory variables and we want to estimate $\theta_u$; and similarly for each item i. Once we’ve estimated the $\theta_u$, we can use the fancier neighborhood models on the specific ratings.

For example, suppose we want to predict Bob’s rating of Titanic. We’ve built a regression model with two explanatory variables, whether the movie was Oscar-nominated (1 if so, -1 if not) and whether the movie contains Kate Winslet (1 if so, -1 if not), and we’ve determined that Bob’s weights on these two variables are -2 (Bob tends to hate Oscar movies) and +1.5 (Bob likes Kate Winslet). Similarly, his friend John has weights +1 and -0.5 for these two variables (John likes Oscars, but dislikes Kate Winslet). So if we know that John rated Titanic a 4, then we have 4 = 1(1) + -0.5(1) + (John’s specific rating), so John’s specific rating of Titanic is 3.5. If we use John’s rating alone to estimate Bob’s, we might guess that Bob would rate Titanic -2(1) + 1.5(1) + (John’s specific rating) = 3.0.

To estimate the $\theta_u$, we actually perform this estimation in sequence: each explanatory variable is used to model the residual from the previous explanatory variable. Also, instead of using the maximum-likelihood unbiased estimator $\hat{\theta_u} = \frac{\sum r_{ui} x_{ui}}{x _ {ui} ^ 2}$, we shrink the weights to prevent overfitting. From a Bayesian point of view, the shrinkage arises from a hierarchical model where the true $\theta_u \sim N(\mu, \sigma^2)$, and $\hat{\theta_u} | \theta_u \sim N(\theta_u, \sigma_u^2)$, leading to $E(\theta_u | \hat{\theta_u}) = \frac{\sigma^2 \hat{\theta_u} + \sigma_u^2 \mu}{\sigma^2 + \sigma_u^2}$.

In practice, the explanatory variables Bell and Koren found to work well included the overall mean of all ratings, each movie’s specific mean, each user’s specific mean, time since movie release, time since user join, and number of ratings for each movie.

Better Neighbor Weights

Let’s consider some deficiencies of the neighborhood approach:

Suppose I want to use the first LOTR movie to predict ratings of the first Harry Potter movie. To do this, I need to say how much weight the first LOTR movie should have in this prediction. But how do I choose this weight? Standard neighborhood approaches essentially pick arbitrary similarity functions (e.g., Pearson correlation, cosine distance) as the weight, possibly testing several similarity functions to see which gives the best performance, but is there a more principled approach to choosing weights?
The standard neighborhood approach ignores the fact that neighbors aren’t independent. For example, suppose all three LOTR movies are neighbors of the first HP movie. Since the three LOTR movies are so similar to each other, the standard approach is overcounting their information. Here’s an analogy: suppose I ask five of my friends where I should eat tonight. Three of them live together (boyfriend, girlfriend, and roommate), and they all recently took a trip together to Japan and are sick of Japanese food, so they vehemently recommend against sushi. Thus, my friends’ recommendations have a stronger bias than would appear if I asked five friends who didn’t know each other at all.

We’ll see how using an optimization method to derive weights (as opposed to deriving weights via a similarity function) overcomes these two limitations.

Recall our problem: we want to predict $r_{ui}$, user u’s rating of item i, and what we have is a set $N(i; u)$ of K neighbors of item i that user u has also rated. (These K neighbors are selected via a similarity function, as is standard.) So what we want to do is find weights $w_{ij}$ such that $r_{ui} = \sum_{j \in N(i; u) w_{ij} r_{uj}}$. A natural approach, then, is simply to choose our weights to minimize $\min_w \sum_{v \neq u} \left( r_{vi} - \sum_{j \in N(i; u)} w_{ij} r_{vj}\right)^2$.

Notice how this optimization solves our two problems above: it’s not only a more principled approach (we choose our weights by minimizing squared error), but by deriving weights simultaneously, we overcome interaction effects.

Differentiating our cost function, we find that the optimal weights satisfy the equation $Aw = b$, where A is a $K \times K$ matrix defined by $A_{jk} = \sum_{v \neq u} r_{vj} r_{vk}$ and $b$ is a vector defined by $b_j = \sum_{v \neq u} r_{vj} r_{vi}$.

However, not all users have rated every movie, so some of the ratings may be missing from the above formulas. So we should instead use an estimate of A and b, such as $\bar{A}_{jk} = \frac{\sum_{v \in U(j,k)} r_{vj} r_{vk}}{|U(j, k)|}$, where $U(j, k)$ is the set of users who rated both j and k, and similarly for b. To avoid overfitting, we should further modify by shrinking to a common mean: $\hat{A}_{jk} = \frac{|U(J,K)|\bar{A}_{jk} + \beta A_{\mu}}{|U(j,k)| + \beta}$, where $\beta$ is a shrinkage parameter and $A_{\mu}$ is the mean over all $\bar{A}$, and similarly for b.

Note that another benefit of our optimization-derived weights is that the weights of neighbors are no longer constrained to sum to 1. Thus, if an item simply has no strong neighbors, the neighbors’ prediction will have only a small effect.

Also, when engineering these methods in practice, we should precompute all item-item similarities and all entries in the matrix $A$.

Better Use of User Data

Neighborhood models typically follow the item-based approach for two reasons:

There are typically many more users than items, and new users come in much more frequently than new items, so it is easier to compute all pairs of item-item similarities.
Users have diverse tastes, so they aren’t as similar to each other. For example, Alice and Eve may both like horror movies, but disagree on comedies.

But there are various reasons we might want to use a user-based approach in addition to an item-based approach (say, a user hasn’t rated many items yet, but we can find similar users based on other types of data, such as browsing history; or, we want to predict user u’s rating on item i, but user u hasn’t rated any items similar to i), so let’s see if we can get around these limitations.

To get around the first limitation, we can project users into a lower-dimensional space (say, by using a singular value decomposition), where we can use a space-partitioning data structure (e.g., a kd-tree) or a nearest-neighbor algorithm (e.g., locality sensitive hashing) to find neighboring users.

To get around the second limitation – that a user u may be predictive of user v for some items, but less so for others – we incorporate item-item similarity into our weighting method. That is, when using the user-neighborhood model to predict user u’s rating on item i, we give higher weight to items similar to i, by choosing the weights to minimize $\min_w \sum_{j \neq i} s_{ij} \left( r_{uj} - \sum_{v \in N(u, i)} w_{uv} r_{vj} \right)^2,$ where the $s_{ij}$ are item-item similarities.

Appendix: Shrinkage

Parameter shrinkage is used a couple times in the paper, so let’s explain what it means.

Suppose that we want to estimate the probability of a coin. If we flip it once and see heads, then the maximum-likelihood estimate of heads is 1. But (as is typical for maximum-likelihood estimates), this is severe overfitting, and what we should do instead is shrink this maximum-likelihood estimate to a prior estimate of the probability of heads, say 1/2. (Note that shrinkage doesn’t necessarily mean decreasing the number, just moving it towards a prior estimate).

How should we perform this shrinkage? If our maximum-likelihood estimate of our parameter $\theta$ is $x$ and our prior mean is $\mu$, a natural estimation of $\theta$ is to use a weighted mean $\alpha x + (1 - \alpha)\mu$, where $\alpha$ is some measure of the degree of belief in our maximum likelihood estimate.

This weighted average approach has several interpretations:

We can also view it as a shrinkage of our maximum likelihood estimate to our prior mean: $\alpha x + (1 - \alpha)\mu = x + (1 - \alpha) (\mu - x)$
We can also view it as a Bayesian posterior: if we use a prior $\theta \sim N(\mu, \tau)$ (where $\tau$ is the precision of our Gaussian, not the variance) and a conditional distribution $x | \theta \sim N(\theta, \tau_x)$, then the posterior mean of $\theta$ is $\theta = \frac{\tau_x}{\tau_x + \tau}x + \frac{\tau}{\tau_x + \tau}\mu,$ which is equivalent to the form above.

Recent Posts