- on Mon 14 March 2011

(Way back when, I went through all the Netflix prize papers. I’m now (very slowly) trying to clean up my notes and put them online. Eventually, I hope to have a more integrated tutorial, but here’s a rough draft for now.)

This is a summary of Bell and Koren’s 2007 Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights paper.

**tl;dr** This paper’s main innovation is deriving neighborhood weights by solving a least squares problem, instead of using a standard similarity function to compute weights.

This paper improves upon the standard neighborhood approach to collaborative filtering in three areas: better data normalization, better neighbor weights (this is the key section), and better use of user data. I’ll first review the standard neighborhood approach, and follow with a description of these enhancements.

## Background: Standard Neighborhood Approach to Collaborative Filtering

Recall that there are two types of neighborhood approaches:

- User-based approaches: to predict user i’s rating of item j, take the users most similar to user i, and perform a weighted average of their ratings of item j.
- Item-based approaches: to predict user i’s rating of item j, perform a weighted average of user i’s ratings of items similar to item j.

For example, to predict how you would rate the first Harry Potter movie, the user-based approach looks at how your friends rated the first Harry Potter movie, while the item-based approach looks at how you rated movies like Lord of the Rings and Twilight.

## Better Data Normalization

Suppose I ask my friend Chris whether I should watch the latest Twilight movie. He tells me he would rate it 4.0/5 stars. Great, that’s a high rating, so that means I should watch it – or does it? It turns out that Chris is a super cheerful guy who’s never met a movie he didn’t like, and his average rating for a movie is actually 4.5/5 stars. So Twilight is actually less than average for him, and hence 4.0/5 stars from Chris isn’t actually that hearty a recommendation.

As another example, suppose you look at doctor ratings on Yelp. They’re abnormally high: the average is far from 3/5 stars. Why is this? Maybe it’s harder for people to change doctors than it is to go to a new restaurant, so people might not want to rate a doctor poorly when they know they’ll have to see the doctor again. Thus, an average rating of 5 stars on a McDonalds restaurant is much more impressive than an average of 5 stars on Dr. Joe.

The lesson is that when using existing ratings, we should normalize out these types of effects, so that ratings are as comparable as possible.

Another way of thinking about this is that we are simply building a regression model. That is, for each user u, we have a model $r_{ui} = (\sum \theta_u x_{ui}) + SpecificRating$, where the $x_{ui}$ are common explanatory variables and we want to estimate $\theta_u$; and similarly for each item i. Once we’ve estimated the $\theta_u$, we can use the fancier neighborhood models on the specific ratings.

For example, suppose we want to predict Bob’s rating of Titanic. We’ve built a regression model with two explanatory variables, whether the movie was Oscar-nominated (1 if so, -1 if not) and whether the movie contains Kate Winslet (1 if so, -1 if not), and we’ve determined that Bob’s weights on these two variables are -2 (Bob tends to hate Oscar movies) and +1.5 (Bob likes Kate Winslet). Similarly, his friend John has weights +1 and -0.5 for these two variables (John likes Oscars, but dislikes Kate Winslet). So if we know that John rated Titanic a 4, then we have 4 = 1(1) + -0.5(1) + (John’s specific rating), so John’s specific rating of Titanic is 3.5. If we use John’s rating alone to estimate Bob’s, we might guess that Bob would rate Titanic -2(1) + 1.5(1) + (John’s specific rating) = 3.0.

To estimate the $\theta_u$, we actually perform this estimation in sequence: each explanatory variable is used to model the *residual* from the previous explanatory variable. Also, instead of using the maximum-likelihood unbiased estimator $\hat{\theta_u} = \frac{\sum r_{ui} x_{ui}}{x _ {ui} ^ 2}$, we shrink the weights to prevent overfitting. From a Bayesian point of view, the shrinkage arises from a hierarchical model where the true $\theta_u \sim N(\mu, \sigma^2)$, and $\hat{\theta_u} | \theta_u \sim N(\theta_u, \sigma_u^2)$, leading to $E(\theta_u | \hat{\theta_u}) = \frac{\sigma^2 \hat{\theta_u} + \sigma_u^2 \mu}{\sigma^2 + \sigma_u^2}$.

In practice, the explanatory variables Bell and Koren found to work well included the overall mean of all ratings, each movie’s specific mean, each user’s specific mean, time since movie release, time since user join, and number of ratings for each movie.

## Better Neighbor Weights

Let’s consider some deficiencies of the neighborhood approach:

- Suppose I want to use the first LOTR movie to predict ratings of the first Harry Potter movie. To do this, I need to say how much weight the first LOTR movie should have in this prediction. But how do I choose this weight? Standard neighborhood approaches essentially pick arbitrary similarity functions (e.g., Pearson correlation, cosine distance) as the weight, possibly testing several similarity functions to see which gives the best performance, but is there a more principled approach to choosing weights?
- The standard neighborhood approach ignores the fact that neighbors aren’t independent. For example, suppose all three LOTR movies are neighbors of the first HP movie. Since the three LOTR movies are so similar to each other, the standard approach is overcounting their information. Here’s an analogy: suppose I ask five of my friends where I should eat tonight. Three of them live together (boyfriend, girlfriend, and roommate), and they all recently took a trip together to Japan and are sick of Japanese food, so they vehemently recommend against sushi. Thus, my friends’ recommendations have a stronger bias than would appear if I asked five friends who didn’t know each other at all.

We’ll see how using an optimization method to derive weights (as opposed to deriving weights via a similarity function) overcomes these two limitations.

Recall our problem: we want to predict $r_{ui}$, user u’s rating of item i, and what we have is a set $N(i; u)$ of K neighbors of item i that user u has also rated. (These K neighbors are selected via a similarity function, as is standard.) So what we want to do is find weights $w_{ij}$ such that $r_{ui} = \sum_{j \in N(i; u) w_{ij} r_{uj}}$. A natural approach, then, is simply to choose our weights to minimize $\min_w \sum_{v \neq u} \left( r_{vi} - \sum_{j \in N(i; u)} w_{ij} r_{vj}\right)^2$.

Notice how this optimization solves our two problems above: it’s not only a more principled approach (we choose our weights by minimizing squared error), but by deriving weights simultaneously, we overcome interaction effects.

Differentiating our cost function, we find that the optimal weights satisfy the equation $Aw = b$, where A is a $K \times K$ matrix defined by $A_{jk} = \sum_{v \neq u} r_{vj} r_{vk}$ and $b$ is a vector defined by $b_j = \sum_{v \neq u} r_{vj} r_{vi}$.

However, not all users have rated every movie, so some of the ratings may be missing from the above formulas. So we should instead use an estimate of A and b, such as $\bar{A}_{jk} = \frac{\sum_{v \in U(j,k)} r_{vj} r_{vk}}{|U(j, k)|}$, where $U(j, k)$ is the set of users who rated both j and k, and similarly for b. To avoid overfitting, we should further modify by shrinking to a common mean: $\hat{A}_{jk} = \frac{|U(J,K)|\bar{A}_{jk} + \beta A_{\mu}}{|U(j,k)| + \beta}$, where $\beta$ is a shrinkage parameter and $A_{\mu}$ is the mean over all $\bar{A}$, and similarly for b.

Note that another benefit of our optimization-derived weights is that the weights of neighbors are no longer constrained to sum to 1. Thus, if an item simply has no strong neighbors, the neighbors’ prediction will have only a small effect.

Also, when engineering these methods in practice, we should precompute all item-item similarities and all entries in the matrix $A$.

## Better Use of User Data

Neighborhood models typically follow the item-based approach for two reasons:

- There are typically many more users than items, and new users come in much more frequently than new items, so it is easier to compute all pairs of item-item similarities.
- Users have diverse tastes, so they aren’t as similar to each other. For example, Alice and Eve may both like horror movies, but disagree on comedies.

But there are various reasons we might want to use a user-based approach *in addition to* an item-based approach (say, a user hasn’t rated many items yet, but we can find similar users based on other types of data, such as browsing history; or, we want to predict user u’s rating on item i, but user u hasn’t rated any items similar to i), so let’s see if we can get around these limitations.

To get around the first limitation, we can project users into a lower-dimensional space (say, by using a singular value decomposition), where we can use a space-partitioning data structure (e.g., a kd-tree) or a nearest-neighbor algorithm (e.g., locality sensitive hashing) to find neighboring users.

To get around the second limitation – that a user u may be predictive of user v for some items, but less so for others – we incorporate item-item similarity into our weighting method. That is, when using the user-neighborhood model to predict user u’s rating on item i, we give higher weight to items similar to i, by choosing the weights to minimize $\min_w \sum_{j \neq i} s_{ij} \left( r_{uj} - \sum_{v \in N(u, i)} w_{uv} r_{vj} \right)^2,$ where the $s_{ij}$ are item-item similarities.

## Appendix: Shrinkage

Parameter shrinkage is used a couple times in the paper, so let’s explain what it means.

Suppose that we want to estimate the probability of a coin. If we flip it once and see heads, then the maximum-likelihood estimate of heads is 1. But (as is typical for maximum-likelihood estimates), this is severe overfitting, and what we should do instead is shrink this maximum-likelihood estimate to a prior estimate of the probability of heads, say 1/2. (Note that shrinkage doesn’t necessarily mean decreasing the number, just moving it towards a prior estimate).

How should we perform this shrinkage? If our maximum-likelihood estimate of our parameter $\theta$ is $x$ and our prior mean is $\mu$, a natural estimation of $\theta$ is to use a weighted mean $\alpha x + (1 - \alpha)\mu$, where $\alpha$ is some measure of the degree of belief in our maximum likelihood estimate.

This weighted average approach has several interpretations:

- We can also view it as a shrinkage of our maximum likelihood estimate to our prior mean: $\alpha x + (1 - \alpha)\mu = x + (1 - \alpha) (\mu - x)$
- We can also view it as a Bayesian posterior: if we use a prior $\theta \sim N(\mu, \tau)$ (where $\tau$ is the precision of our Gaussian, not the variance) and a conditional distribution $x | \theta \sim N(\theta, \tau_x)$, then the posterior mean of $\theta$ is $\theta = \frac{\tau_x}{\tau_x + \tau}x + \frac{\tau}{\tau_x + \tau}\mu,$ which is equivalent to the form above.