<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title>Edwin Chen&#8217;s Blog</title>
  <link href="http://blog.echen.me/atom.xml" rel="self"/>
  <link href="http://blog.echen.me/"/>
  <updated>2013-03-31T16:41:11-07:00</updated>
  <id>http://blog.echen.me/</id>
  <author>
    <name>Edwin Chen</name>
    
  </author>

  
  <entry>
    <title>Improving Twitter search with real-time human computation</title>
    <link href="http://blog.echen.me/2013/01/08/improving-twitter-search-with-real-time-human-computation/"/>
    <updated>2013-01-08T13:08:00-08:00</updated>
    <id>http://blog.echen.me/2013/01/08/improving-twitter-search-with-real-time-human-computation</id>
    <content type="html">&lt;p&gt;&lt;em&gt;(This is a post from the &lt;a href=&quot;http://engineering.twitter.com/2013/01/improving-twitter-search-with-real-time.html&quot;&gt;Twitter Engineering Blog&lt;/a&gt; that I wrote with &lt;a href=&quot;https://twitter.com/alpa&quot;&gt;Alpa Jain&lt;/a&gt;.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One of the magical things about Twitter is that it opens a window to the world in &lt;strong&gt;real-time&lt;/strong&gt;. An event happens, and just seconds later, it&amp;#8217;s shared for people across the planet to see.&lt;/p&gt;

&lt;p&gt;Consider, for example, what happened when Flight 1549 crashed in the Hudson.&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p&gt;&lt;a href=&quot;http://twitpic.com/135xa&quot;&gt;http://twitpic.com/135xa&lt;/a&gt; - There&amp;#8217;s a plane in the Hudson. I&amp;#8217;m on the ferry going to pick up the people. Crazy.&lt;/p&gt;&amp;mdash; Janis Krums (@jkrums) &lt;a href=&quot;https://twitter.com/jkrums/status/1121915133&quot; data-datetime=&quot;2009-01-15T20:36:04+00:00&quot;&gt;January 15, 2009&lt;/a&gt;&lt;/blockquote&gt;


&lt;script async src=&quot;http://blog.echen.me//platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;




&lt;br /&gt;


&lt;p&gt;When Osama bin Laden was killed.&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p&gt;Helicopter hovering above Abbottabad at 1AM (is a rare event).&lt;/p&gt;&amp;mdash; Sohaib Athar (@ReallyVirtual) &lt;a href=&quot;https://twitter.com/ReallyVirtual/status/64780730286358528&quot; data-datetime=&quot;2011-05-01T19:58:24+00:00&quot;&gt;May 1, 2011&lt;/a&gt;&lt;/blockquote&gt;




&lt;br /&gt;


&lt;p&gt;Or when Mitt Romney mentioned binders during the presidential debates.&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p&gt;Boy, I&amp;#8217;m full of women! &lt;a href=&quot;https://twitter.com/search/%23debates&quot;&gt;#debates&lt;/a&gt;&lt;/p&gt;&amp;mdash; Romney&amp;#8217;s Binder (@RomneysBinder) &lt;a href=&quot;https://twitter.com/RomneysBinder/status/258383626918588417&quot; data-datetime=&quot;2012-10-17T01:47:11+00:00&quot;&gt;October 17, 2012&lt;/a&gt;&lt;/blockquote&gt;




&lt;br /&gt;


&lt;p&gt;When each of these events happened, people instantly came to Twitter &amp;#8211; and, in particular, Twitter search &amp;#8211; to discover what was happening.&lt;/p&gt;

&lt;p&gt;From a search and advertising perspective, however, these sudden events pose several challenges:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The queries people perform have never before been seen, so it&amp;#8217;s impossible to know beforehand what they mean. How would you know that #bindersfullofwomen refers to politics, and not office accessories, or that people searching for &amp;#8220;Horses and Bayonets&amp;#8221; are interested in the debates?&lt;/li&gt;
&lt;li&gt;Since these spikes in search queries are so &lt;a href=&quot;http://arxiv.org/abs/1205.6855&quot;&gt;short-lived&lt;/a&gt;, there’s only a short window of opportunity to learn what they mean.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;So an event happens, people instantly come to Twitter to search for the event, and we need to teach our systems what these queries mean as quickly as we can, because in just a few hours those searches will be gone.&lt;/p&gt;

&lt;p&gt;How do we do this? We&amp;#8217;ll describe a novel real-time human computation engine we built that allows us to find search queries as soon as they&amp;#8217;re trending, send these queries to real humans to be judged, and finally incorporate these human annotations into our backend models.&lt;/p&gt;

&lt;h2&gt;Overview&lt;/h2&gt;

&lt;p&gt;Before we delve into the details, here&amp;#8217;s an overview of how the system works.&lt;/p&gt;

&lt;p&gt;(1) First, we monitor for which search queries are currently popular.&lt;/p&gt;


&lt;p&gt;Behind the scenes: we run a Storm topology that tracks statistics on search queries.&lt;/p&gt;


&lt;p&gt;For example: the query &amp;#8220;Big Bird&amp;#8221; may be averaging zero searches a day, but at 6pm on October 3, we suddenly see a spike in searches from the US.&lt;/p&gt;




&lt;p&gt;(2) Next, as soon as we discover a new popular search query, we send it to our human evaluation systems, where judges are asked a variety of questions about the query.&lt;/p&gt;


&lt;p&gt;Behind the scenes: when the Storm topology detects that a query has reached sufficient popularity, it connects to a Thrift API that dispatches the query to Amazon&amp;#8217;s Mechanical Turk service, and then polls Mechanical Turk for a response.&lt;/p&gt;


&lt;p&gt;For example: as soon as we notice &amp;#8220;Big Bird&amp;#8221; spiking, we may ask judges on Mechanical Turk to categorize the query, or provide other information (e.g., whether there are likely to be interesting pictures of the query, or whether the query is about a person or an event) that helps us serve relevant tweets and ads.&lt;/p&gt;


&lt;p&gt;Finally, after a response from a judge is received, we push the information to our backend systems, so that the next time a user searches for a query, our machine learning models will make use of the additional information. For example, suppose our human judges tell us that &amp;#8220;Big Bird&amp;#8221; is related to politics; the next time someone performs this search, we know to surface ads by @barackobama or @mittromney, not ads about Dora the Explorer.&lt;/p&gt;

&lt;p&gt;Let’s now explore the first two sections above in more detail.&lt;/p&gt;

&lt;h2&gt;Monitoring for popular queries&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/nathanmarz/storm&quot;&gt;Storm&lt;/a&gt; is a distributed system for real-time computation. In contrast to &lt;em&gt;batch&lt;/em&gt; systems like Hadoop, which often introduce delays of hours or more, Storm allows us to run online data processing algorithms to discover search spikes as soon as they happen.&lt;/p&gt;

&lt;p&gt;In brief, running a job on Storm involves creating a Storm topology that describes the processing steps that must occur, and deploying this topology to a Storm cluster. A topology itself consists of three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Tuple streams&lt;/strong&gt; of data. In our case, these may be tuples of (search query, timestamp).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spouts&lt;/strong&gt; that produce these tuple streams. In our case, we attach spouts to our search logs, which get written to every time a search occurs.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bolts&lt;/strong&gt; that process tuple streams. In our case, we use bolts for operations like updating total query counts, filtering out non-English queries, and checking whether an ad is currently being served up for the query.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Here’s a step-by-step walkthrough of how our popular query topology works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Whenever you perform a search on Twitter, the search request gets logged to a &lt;a href=&quot;http://kafka.apache.org/&quot;&gt;Kafka queue&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The Storm topology attaches a spout to this Kafka queue, and the spout emits a tuple containing the query and other metadata (e.g., the time the query was issued and its location) to a bolt for processing.&lt;/li&gt;
&lt;li&gt;This bolt updates the count of the number of times we&amp;#8217;ve seen this query, checks whether the query is &amp;#8220;currently popular&amp;#8221; (using various statistics like time-decayed counts, the geographic distribution of the query, and the last time this query was sent for annotations), and dispatches it to our human computation pipeline if so.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;One interesting feature of our popularity algorithm is that we often rejudge queries that have been annotated before, since the intent of a search can change. For example, perhaps people normally search for &amp;#8220;Clint Eastwood&amp;#8221; because they&amp;#8217;re interested in his movies, but during the Republican National Convention users may have wanted to see tweets that were more political in nature.&lt;/p&gt;

&lt;h2&gt;Human evaluation of popular search queries&lt;/h2&gt;

&lt;p&gt;At Twitter, we use &lt;a href=&quot;http://blog.echen.me/2012/04/25/making-the-most-of-mechanical-turk-tips-and-best-practices/&quot;&gt;human computation&lt;/a&gt; for a variety of tasks. (See also &lt;a href=&quot;https://github.com/twitter/clockworkraven&quot;&gt;Clockwork Raven&lt;/a&gt;, an open-source project we built that makes launching tasks easier.) For example, we often run experiments to measure ad relevance and search quality, we use it to gather data to train and evaluate our machine learning models, and in this section we&amp;#8217;ll describe how we use it to boost our understanding of popular search queries.&lt;/p&gt;

&lt;p&gt;So suppose that our Storm topology has detected that the query &amp;#8220;Big Bird&amp;#8221; is suddenly spiking. Since the query may remain popular for only a few hours, we send it off to live humans, who can help us quickly understand what it means; this dispatch is performed via a Thrift service that allows us to design our tasks in a &lt;a href=&quot;http://engineering.twitter.com/2012/08/crowdsourced-data-analysis-with.html&quot;&gt;web frontend&lt;/a&gt;, and later programmatically submit them to Mechanical Turk using any of the different languages we use across Twitter.&lt;/p&gt;

&lt;p&gt;On Mechanical Turk, judges are asked several questions about the query that help us serve better ads. Without going into the exact questions, here are flavors of a few possibilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What category does the query belong to? For example, &amp;#8220;Stanford&amp;#8221; may typically be an education-related query, but perhaps there&amp;#8217;s a football game between Stanford and Berkeley at the moment, in which case the current search intent would be sports.&lt;/li&gt;
&lt;li&gt;Does the query refer to a person? If so, who, and what is their Twitter handle if they have one? For example, the query &amp;#8220;Happy Birthday Harry&amp;#8221; may be trending, but it&amp;#8217;s hard to know beforehand which of the numerous celebrities named Harry it&amp;#8217;s referring to. Is it &lt;a href=&quot;https://twitter.com/onedirection&quot;&gt;One Direction&lt;/a&gt;&amp;#8217;s &lt;a href=&quot;https://twitter.com/Harry_Styles&quot;&gt;Harry Styles&lt;/a&gt;, in which case the searcher is likely to be interested in teen pop? Harry Potter, in which case the searcher is likely to be interested in fantasy novels? Or someone else entirely?&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;Turkers in the machine&lt;/h3&gt;

&lt;p&gt;Since humans are core to this system, let&amp;#8217;s describe how our workforce was designed to give us fast, reliable results.&lt;/p&gt;

&lt;p&gt;For completing all our tasks, we use a small &lt;em&gt;custom&lt;/em&gt; pool of Mechanical Turk judges to ensure high quality. Other typical possibilities in the crowdsourcing world are to use a static set of in-house judges, to use the standard worker filters that Amazon provides, or to go through an outside company like &lt;a href=&quot;http://crowdflower.com/&quot;&gt;Crowdflower&lt;/a&gt;. We&amp;#8217;ve experimented with these other solutions, and while they have their own benefits, we found that a custom pool fit our needs best for a few reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In-house judges can provide high-quality work as well, but they usually work standard hours (for example, 9 to 5 if they work onsite, or a relatively fixed and limited set of hours if they work from home), it can be difficult to communicate with them and schedule them for work, and it&amp;#8217;s hard to scale the hiring of more judges.&lt;/li&gt;
&lt;li&gt;Using Crowdflower or Amazon&amp;#8217;s standard filters makes it easy to scale the workforce, but their trust algorithms aren&amp;#8217;t perfect, so an endless problem is that spammy workers get through and many of the judgments will be very poor quality. Two methods of combatting low quality are to seed gold standard examples for which you know the true response throughout your task, or to use statistical analysis to determine which workers are the good ones, but these can be time-consuming and expensive to create, and we often run tasks of a free-response researchy nature for which these solutions don&amp;#8217;t work. Another problem is that using these filters gives you a &lt;em&gt;fluid&lt;/em&gt;, constantly changing set of workers, which makes them hard to train.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;In contrast:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Our custom pool of judges work virtually all day. For many of them, this is a full-time job, and they&amp;#8217;re geographically distributed, so our tasks complete quickly at all hours; we can easily ask for thousands of judgments before lunch, and have them finished by the time we get back, which makes iterating on our experiments much easier.&lt;/li&gt;
&lt;li&gt;We have several forums, mailing lists, and even live chatrooms set up, all of which makes it easy for judges to ask us questions and to respond to feedback. Our judges will even give &lt;em&gt;us&lt;/em&gt; suggestions on how to improve our tasks; for example, when we run categorization tasks, they&amp;#8217;ll often report helpful categories that we should add.&lt;/li&gt;
&lt;li&gt;Since we only launch tasks on demand, and Amazon provides a ready source of workers if we ever need more, our judges are never idly twiddling their thumbs waiting for tasks or completing busywork, and our jobs are rarely backlogged.&lt;/li&gt;
&lt;li&gt;Because our judges are culled from the best of Mechanical Turk, they&amp;#8217;re experts at the kinds of tasks we send, and can often provide higher quality at a faster rate than what even in-house judges provide. For example, they&amp;#8217;ll often use the forums and chatrooms to collaborate amongst themselves to give us the best judgments, and they&amp;#8217;re already familiar with the Firefox and Chrome scripts that help them be the most efficient at their work.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;All the benefits described above are especially valuable in this real-time search annotation case:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Having highly trusted workers means we don&amp;#8217;t need to wait for multiple annotations on a single search query to confirm  validity, so we can send responses to our backend as soon as a single judge responds. This entire pipeline is design for &lt;em&gt;real-time&lt;/em&gt;, after all, so the lower the latency on the human evaluation part, the better.&lt;/li&gt;
&lt;li&gt;The static nature of our custom pool means that the judges are already familiar with our questions, and don&amp;#8217;t need to be trained again.&lt;/li&gt;
&lt;li&gt;Because our workers aren&amp;#8217;t limited to a fixed schedule or location, they can work anywhere, anytime &amp;#8211; which is a requirement for this system, since global event spikes on Twitter are not beholden to a 9-to-5.&lt;/li&gt;
&lt;li&gt;And with the multiple easy avenues of communication we have set up, it&amp;#8217;s easy for us to answer questions that might arise when we add new questions or modify existing ones.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;Singing telegram summary&lt;/h2&gt;

&lt;p&gt;Let&amp;#8217;s end with an example of the kind of top quality our workers provide, a crowdsourced singing summary we used to celebrate the project&amp;#8217;s launch.&lt;/p&gt;

&lt;iframe width=&quot;640&quot; height=&quot;510&quot; src=&quot;http://www.youtube.com/embed/EIK8iVnU5EU &quot; frameborder=&quot;0&quot; allowfullscreen&gt;&lt;/iframe&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;This video was created entirely by our workers, from the crowdsourced lyrics, to the crowdsourced graphics, and even the piano playing and singing. Special thank you in particular to our amazing Turker, workasaurusrex, the musician and silky smooth crooner who brought the masterpiece together.&lt;/p&gt;

&lt;h2&gt;Thanks&lt;/h2&gt;

&lt;p&gt;Thanks to everyone on the Revenue and Storm teams, as well as our Turkers, for helping us launch this project.&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Edge Prediction in a Social Graph: My Solution to Facebook&#8217;s User Recommendation Contest on Kaggle</title>
    <link href="http://blog.echen.me/2012/07/31/edge-prediction-in-a-social-graph-my-solution-to-facebooks-user-recommendation-contest-on-kaggle/"/>
    <updated>2012-07-31T10:17:00-07:00</updated>
    <id>http://blog.echen.me/2012/07/31/edge-prediction-in-a-social-graph-my-solution-to-facebooks-user-recommendation-contest-on-kaggle</id>
    <content type="html">&lt;p&gt;A couple weeks ago, Facebook launched a &lt;a href=&quot;http://www.kaggle.com/c/FacebookRecruiting/&quot;&gt;link prediction contest&lt;/a&gt; on Kaggle, with the goal of recommending missing edges in a social graph. &lt;a href=&quot;http://blog.echen.me/2011/09/07/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/&quot;&gt;I love investigating social networks&lt;/a&gt;, so I dug around a little, and since I did well enough to score one of the coveted prizes, I&amp;#8217;ll share my approach here.&lt;/p&gt;

&lt;p&gt;(For some background, the contest provided a training dataset of edges, a test set of nodes, and contestants were asked to predict missing outbound edges on the test set, using mean average precision as the evaluation metric.)&lt;/p&gt;

&lt;h1&gt;Exploration&lt;/h1&gt;

&lt;p&gt;What does the network look like? I wanted to play around with the data a bit first just to get a rough feel, so I made an &lt;a href=&quot;http://link-prediction.herokuapp.com/network&quot;&gt;app&lt;/a&gt; to interact with the network around each node.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s a sample:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://link-prediction.herokuapp.com/network&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_untrimmed.png&quot; alt=&quot;1 Untrimmed Network&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(Go ahead, click on the picture to &lt;a href=&quot;http://link-prediction.herokuapp.com/network&quot;&gt;play with the app yourself&lt;/a&gt;. It&amp;#8217;s pretty fun.)&lt;/p&gt;

&lt;p&gt;The node in black is a selected node from the training set, and we perform a breadth-first walk of the graph out to a maximum distance of 3 to uncover the local network. Nodes are sized according to their distance from the center, and colored according to a chosen metric (a personalized PageRank in this case; more on this later).&lt;/p&gt;

&lt;p&gt;We can see that the central node is friends with three other users (in red), two of whom have fairly large, disjoint networks.&lt;/p&gt;

&lt;p&gt;There are quite a few dangling nodes (nodes at distance 3 with only one connection to the rest of the local network), though, so let&amp;#8217;s remove these to reveal the core structure:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://link-prediction.herokuapp.com/network&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_network.png&quot; alt=&quot;1 Untrimmed Network&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And here&amp;#8217;s an embedded version you can manipulate inline:&lt;/p&gt;

&lt;iframe width=&quot;600px&quot; height=&quot;500px&quot; src=&quot;http://link-prediction.herokuapp.com/network?for_embed=true&quot;&gt;&lt;/iframe&gt;


&lt;p&gt;Since the default view doesn&amp;#8217;t encode the distinction between following and follower relationships, we can mouse over each node to see who it follows and who it&amp;#8217;s followed by. Here, for example, is the following/follower network of one of the central node&amp;#8217;s friends:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_friend1.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_friend1.png&quot; alt=&quot;1 - Friend1&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The moused over node is highlighted in black, its friends (users who both follow the node and are followed back in turn) are colored in purple, its followees are teal, and its followers in orange. We can also see that the node shares a friend with the central user (&lt;a href=&quot;http://en.wikipedia.org/wiki/Triadic_closure&quot;&gt;triadic closure&lt;/a&gt;, &lt;em&gt;holla!&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s another network, this time of the friend at the bottom:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_friend2.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_friend2.png&quot; alt=&quot;1 - Friend2&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interestingly, while the first friend had several only-followers (in orange), the second friend has none. (which suggests, perhaps, a node-level feature that measures how follow-hungry a user is&amp;#8230;)&lt;/p&gt;

&lt;p&gt;And here&amp;#8217;s one more node, a little further out (maybe a celebrity, given it has nothing but followers?):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_celebrity.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_celebrity.png&quot; alt=&quot;1 - Celebrity&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;The Quiet One&lt;/h2&gt;

&lt;p&gt;Let&amp;#8217;s take a look at another graph, one whose local network is a little smaller:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/4_network.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/4_network.png&quot; alt=&quot;4 Network&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;A Social Butterfly&lt;/h2&gt;

&lt;p&gt;And one more, whose local network is a little larger:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/2_network.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/2_network.png&quot; alt=&quot;2 Network&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/2_friend.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/2_friend.png&quot; alt=&quot;2 Network - Friend&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Again, I encourage everyone to play around with the app &lt;a href=&quot;http://link-prediction.herokuapp.com/network&quot;&gt;here&lt;/a&gt;, and I&amp;#8217;ll come back to the question of coloring each node later.&lt;/p&gt;

&lt;h1&gt;Distributions&lt;/h1&gt;

&lt;p&gt;Next, let&amp;#8217;s take a more quantitative look at the graph.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s the distribution of the number of followers of each node in the training set (cut off at 50 followers for a better fit &amp;#8211; the maximum number of followers is 552), as well as the number of users each node is following (again, cut off at 50 &amp;#8211; the maximum here is 1566)&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/training_full_followers.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/training_full_followers.png&quot; alt=&quot;Training Followers&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/training_full_followees.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/training_full_followees.png&quot; alt=&quot;Training Followees&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nothing terribly surprising, but that alone is good to verify. (For people tempted to mutter about power laws, I&amp;#8217;ll hold you off with the bitter coldness of &lt;a href=&quot;http://cscs.umich.edu/~crshalizi/weblog/491.html&quot;&gt;baby Gauss&amp;#8217;s tears&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;Similarly, here are the same two graphs, but limited to the nodes in the test set alone:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/test_followers.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/test_followers.png&quot; alt=&quot;Test Followers&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/test_followees.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/test_followees.png&quot; alt=&quot;Test Followees&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice that there are relatively more test set users with 0 followees than in the full training set, and relatively fewer test set users with 0 followers. This information could be used to better simulate a validation set for model selection, though I didn&amp;#8217;t end up doing this myself.&lt;/p&gt;

&lt;h1&gt;Preliminary Probes&lt;/h1&gt;

&lt;p&gt;Finally, let&amp;#8217;s move on to the models themselves.&lt;/p&gt;

&lt;p&gt;In order to quickly get up and running on a couple prediction algorithms, I started with some unsupervised approaches. For example, after building a new validation set* to test performance offline, I tried:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recommending users who follow you (but you don&amp;#8217;t follow in return)&lt;/li&gt;
&lt;li&gt;Recommending users similar to you (when representing users as sets of their followers, and using cosine similarity and Jaccard similarity as the similarity metric)&lt;/li&gt;
&lt;li&gt;Recommending users based on a personalized PageRank score&lt;/li&gt;
&lt;li&gt;Recommending users that the people you follow also follow&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;And so on, combining the votes of these algorithms in a fairly ad-hoc way (e.g., by taking the majority vote or by ordering by the number of followers).&lt;/p&gt;

&lt;p&gt;This worked quite well actually, but I&amp;#8217;d been planning to move on to a more machine learned model-based approach from the beginning, so I did that next.&lt;/p&gt;

&lt;p&gt;*My validation set was formed by deleting random edges from the full training set. A slightly better approach, as mentioned above, might have been to more accurately simulate the distribution of the official test set, but I didn&amp;#8217;t end up trying this out myself.&lt;/p&gt;

&lt;h1&gt;Candidate Selection&lt;/h1&gt;

&lt;p&gt;In order to run a machine learning algorithm to recommend edges (which would take two nodes, a source and a candidate destination, and generate a score measuring the likelihood that the source would follow the destination), it&amp;#8217;s necessary to prune the set of candidates to run the algorithm on.&lt;/p&gt;

&lt;p&gt;I used two approaches for this filtering step, both based on random walks on the graph.&lt;/p&gt;

&lt;h2&gt;Personalized PageRank&lt;/h2&gt;

&lt;p&gt;The first approach was to calculate a personalized PageRank around each source node.&lt;/p&gt;

&lt;p&gt;Briefly, a personalized PageRank is like standard PageRank, except that when randomly teleporting to a new node, the surfer always teleports back to the given source node being personalized (rather than to a node chosen uniformly at random, as in the classic PageRank algorithm).&lt;/p&gt;

&lt;p&gt;That is, the random surfer in the personalized PageRank model works as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;He starts at the source node $X$ that we want to calculate a personalized PageRank around.&lt;/li&gt;
&lt;li&gt;At step $i$: with probability $p$, the surfer moves to a neighboring node chosen uniformly at random; with probability $1-p$, the surfer instead teleports back to the original source node $X$.&lt;/li&gt;
&lt;li&gt;The limiting probability that the surfer is at node $N$ is then the personalized PageRank score of node $N$ around $X$.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Here&amp;#8217;s some Scala code that computes approximate personalized PageRank scores and takes the highest-scoring nodes as the candidates to feed into the machine learning model:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Personalized PageRank&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;21&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;22&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;23&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;24&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;25&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;26&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;27&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;28&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;29&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;30&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;31&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;32&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;33&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;34&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;35&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;36&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;37&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;38&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;39&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;40&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;41&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;42&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;43&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;44&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;45&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;46&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;47&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;48&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;49&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;50&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;51&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;52&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;53&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;54&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;55&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;56&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;57&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;58&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;59&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;60&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;61&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;62&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;63&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;64&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;65&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;66&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;67&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Calculate a personalized PageRank around the given user, and return &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * a list of the nodes with the highest personalized PageRank scores.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * @return A list of (node, probability of landing at this node after&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *         running a personalized PageRank for K iterations) pairs.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pageRank&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;List&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;c1&quot;&gt;// This map holds the probability of landing at each node, up to the &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;c1&quot;&gt;// current iteration.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;probs&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]()&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;probs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// We start at this user.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pageRankProbs&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pageRankHelper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;probs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;NumPagerankIterations&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;pageRankProbs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toList&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;               &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sortBy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;               &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;score&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                  &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getFollowings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contains&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;take&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;MaxNodesToKeep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Simulates running a personalized PageRank for one iteration.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Parameters:&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * start - the start node to calculate the personalized PageRank around&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * probs - a map from nodes to the probability of being at that node at &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *         the start of the current iteration&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * numIterations - the number of iterations remaining&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * alpha - with probability alpha, we follow a neighbor; with probability&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *         1 - alpha, we teleport back to the start node&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * @return A map of node -&amp;gt; probability of landing at that node after the&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *         specified number of iterations.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pageRankHelper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;probs&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numIterations&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                   &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.5&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numIterations&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;probs&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;c1&quot;&gt;// Holds the updated set of probabilities, after this iteration.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;probsPropagated&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]()&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;c1&quot;&gt;// With probability 1 - alpha, we teleport back to the start node.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;probsPropagated&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;c1&quot;&gt;// Propagate the previous probabilities&#8230;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;probs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;foreach&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;prob&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;forwards&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getFollowings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;backwards&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getFollowers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;// With probability alpha, we move to a follower&#8230;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;// And each node distributes its current probability equally to &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;// its neighbors.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;probToPropagate&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;prob&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;forwards&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;backwards&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;forwards&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toList&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;++&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;backwards&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toList&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;foreach&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;neighbor&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;probsPropagated&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contains&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;neighbor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;          &lt;span class=&quot;n&quot;&gt;probsPropagated&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;neighbor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;n&quot;&gt;probsPropagated&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;neighbor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;probToPropagate&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;pageRankHelper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;start&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;probsPropagated&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numIterations&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;h2&gt;Propagation Score&lt;/h2&gt;

&lt;p&gt;Another approach I used, based on &lt;a href=&quot;http://www.kaggle.com/c/FacebookRecruiting/forums/t/2082/0-711-is-the-new-0&quot;&gt;a proposal by another contestant on the Kaggle forums&lt;/a&gt;, works as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start at a specified user node and give it some score.&lt;/li&gt;
&lt;li&gt;In the first iteration, this user propagates its score equally to its neighbors.&lt;/li&gt;
&lt;li&gt;In the second iteration, each user duplicates and keeps half of its score S. It then propagates S equally to its neighbors.&lt;/li&gt;
&lt;li&gt;In subsequent iterations, the process is repeated, except that neighbors reached via a backwards link don&amp;#8217;t duplicate and keep half of their score. (The idea is that we want the score to reach followees and not followers.)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Here&amp;#8217;s some Scala code to calculate these propagation scores:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Propagation Score&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;21&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;22&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;23&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;24&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;25&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;26&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;27&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;28&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;29&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;30&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;31&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;32&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;33&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;34&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;35&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;36&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;37&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;38&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;39&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;40&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;41&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;42&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;43&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;44&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;45&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;46&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;47&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;48&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;49&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;50&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;51&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;52&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;53&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;54&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;55&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;56&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;57&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;58&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;59&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;60&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;61&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Calculate propagation scores around the current user.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * In the first propagation round, we&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * - Give the starting node N an initial score S.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * - Propagate the score equally to each of N&amp;#39;s neighbors (followers &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *   and followings).&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * - Each first-level neighbor then duplicates and keeps half of its score&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *   and then propagates the original again to its neighbors.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * In further rounds, neighbors then repeat the process, except that neighbors &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * traveled to via a backwards/follower link don&amp;#39;t keep half of their score.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * @return a sorted list of (node, propagation score) pairs.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;propagate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;List&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scores&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]()&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;c1&quot;&gt;// We propagate the score equally to all neighbors.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scoreToPropagate&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1.0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getFollowings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getFollowers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getFollowings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toList&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;++&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getFollowers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toList&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;foreach&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;c1&quot;&gt;// Propagate the score&#8230;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;continuePropagation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scores&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scoreToPropagate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;c1&quot;&gt;// &#8230;and make sure it keeps half of it for itself.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;scores&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scores&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getOrElse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scoreToPropagate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;scores&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toList&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sortBy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;               &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nodeAndScore&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                 &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nodeAndScore&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_1&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                 &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getFollowings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contains&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;take&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;MaxNodesToKeep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * In further rounds, neighbors repeat the process above, except that neighbors&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * traveled to via a backwards/follower link don&amp;#39;t keep half of their score.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;continuePropagation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scores&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;score&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                        &lt;span class=&quot;n&quot;&gt;currIteration&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Unit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;currIteration&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;NumIterations&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scoreToPropagate&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getFollowings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getFollowers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;getFollowings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;foreach&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;// Propagate the score&#8230;        &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;n&quot;&gt;continuePropagation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scores&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scoreToPropagate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;currIteration&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;// &#8230;and make sure it keeps half of it for itself.        &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;n&quot;&gt;scores&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scores&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getOrElse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scoreToPropagate&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;getFollowers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;foreach&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;// Propagate the score&#8230;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;n&quot;&gt;continuePropagation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scores&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;scoreToPropagate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;currIteration&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;// &#8230;but backward links (except for the starting node&amp;#39;s immediate&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;// neighbors) don&amp;#39;t keep any score for themselves.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;I played around with tweaking some parameters in both approaches (e.g., weighting followers and followees differently), but the natural defaults (as used in the code above) ended up performing the best.&lt;/p&gt;

&lt;h1&gt;Features&lt;/h1&gt;

&lt;p&gt;After pruning the set of candidate destination nodes to a more feasible level, I fed pairs of (source, destination) nodes into a machine learning model. From each pair, I extracted around 30 features in total.&lt;/p&gt;

&lt;p&gt;As mentioned above, one feature that worked quite well on its own was whether the destination node already follows the source.&lt;/p&gt;

&lt;p&gt;I also used a wide set of similarity-based features, for example, the Jaccard similarity between the source and destination when both are represented as sets of their followers, when both are represented as sets of their followees, or when one is represented as a set of followers while the other is represented as a set of followees.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Similarity Metrics&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;21&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;22&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;23&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;24&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;25&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;26&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;27&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;28&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;29&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;30&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;31&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;32&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;33&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;34&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;35&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;36&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;37&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;38&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;39&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;40&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;41&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;42&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;43&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;44&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;45&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;46&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;47&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;48&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;49&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;abstract&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SimilarityMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set1&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;set2&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;object&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;JaccardSimilarity&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SimilarityMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;   * Returns the Jaccard similarity between two sets, 0 if both are empty.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;   */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set1&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;set2&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;union&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;union&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;union&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;set2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toFloat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;union&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;object&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;CosineSimilarity&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SimilarityMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;   * Returns the cosine similarity between two sets, 0 if both are empty.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;   */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set1&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;set2&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;set2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;set2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toFloat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sqrt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;set2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// ************&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// * FEATURES *&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// ************&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Returns the similarity between user1 and user2 when both are represented as&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * sets of followers.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;similarityByFollowers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user1&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user2&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                         &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;implicit&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;similarity&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;SimilarityMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;similarity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getFollowersWithout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                   &lt;span class=&quot;n&quot;&gt;getFollowersWithout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// etc.&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;Along the same lines, I also computed a similarity score between the destination node and the source node&amp;#8217;s followees, and several variations thereof.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Extended Similarity Scores&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Iterate over each of user1&amp;#39;s followings, compute their similarity with&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * user2 when both are represented as sets of followers, and return the &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * sum of these similarities.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followerBasedSimilarityToFollowing&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user1&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user2&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;implicit&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;similarity&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;SimilarityMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;getFollowingsWithout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;similarityByFollowers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;similarity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;Other features included the number of followers and followees of each node, the ratio of these, the personalized PageRank and propagation scores themselves, the number of followers in common, and triangle/closure-type features (e.g., whether the source node is friends with a node X who in turn is a friend of the destination node).&lt;/p&gt;

&lt;p&gt;If I had had more time, I would probably have tried weighted and more regularized versions of some of these features as well (e.g., downweighting nodes with large numbers of followers when computing cosine similarity scores based on followees, or shrinking the scores of nodes we have little information about).&lt;/p&gt;

&lt;h1&gt;Feature Understanding&lt;/h1&gt;

&lt;p&gt;But what are these features actually &lt;em&gt;doing&lt;/em&gt;? Let&amp;#8217;s use the same app I built before to take a look.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s the local network of node 317 (different from the node above), where each node is colored by its personalized PageRank (higher scores are in darker red):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_propagation.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_propagation.png&quot; alt=&quot;317 - Personalized PageRank&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If we look at the following vs. follower relationships of the central node (recall that purple is friends, teal is followings, orange is followers):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_following_followers.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_following_followers.png&quot; alt=&quot;317 - Personalized PageRank&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&amp;#8230;we can see that, as expected (because edges that represented both following and follower were double-weighted in my PageRank calculation), the darkest red nodes are those that are friends with the central node, while those in a following-only or follower-only relationship have a lower score.&lt;/p&gt;

&lt;p&gt;How does the propagation score compare to personalized PageRank? Here, I colored each node according to the log ratio of its propagation score and personalized PageRank:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_log_ratio.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_log_ratio.png&quot; alt=&quot;317 - Log Ratio&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Comparing this coloring with the local follow/follower network:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_propagation_local.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_propagation_local.png&quot; alt=&quot;317 - Local Network of Node&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&amp;#8230;we can see that followed nodes (in teal) receive a higher propagation weight than friend nodes (in purple), while follower nodes (in orange) receive almost no propagation score at all.&lt;/p&gt;

&lt;p&gt;Going back to node 1, let&amp;#8217;s look at a different metric. Here, each node is colored according to its Jaccard similarity with the source, when nodes are represented by the set of their followers:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_sim_by_followers.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_sim_by_followers.png&quot; alt=&quot;1 - Similarity by Followers&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that, while the PageRank and propagation metrics tended to favor nodes &lt;em&gt;close&lt;/em&gt; to the central node, the Jaccard similarity feature helps us explore nodes that are further out.&lt;/p&gt;

&lt;p&gt;However, if we look the high-scoring nodes more closely, we see that they often have only a single connection to the rest of the network:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_single_connection.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_single_connection.png&quot; alt=&quot;1 - Single Connection&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In other words, their high Jaccard similarity is due to the fact that they don&amp;#8217;t have many connections to begin with. This suggests that some regularization or shrinking is in order.&lt;/p&gt;

&lt;p&gt;So here&amp;#8217;s a regularized version of Jaccard similarity, where we downweight nodes with few connections:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1-regularized.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/1-regularized.png&quot; alt=&quot;1 - Regularized&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that the outlier nodes are much more muted this time around.&lt;/p&gt;

&lt;p&gt;For a starker difference, compare the following two graphs of the Jaccard similarity metric around node 317 (the first graph is an unregularized version, the second is regularized):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_unregularized.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_unregularized.png&quot; alt=&quot;317 - Unregularized&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_regularized.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_regularized.png&quot; alt=&quot;317 - Regularized&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice, in particular, how the popular node in the top left and the popular nodes at the bottom have a much higher score when we regularize.&lt;/p&gt;

&lt;p&gt;And again, there are other networks and features I haven&amp;#8217;t mentioned here, so play around and discover them on the &lt;a href=&quot;http://link-prediction.herokuapp.com/&quot;&gt;app&lt;/a&gt; itself.&lt;/p&gt;

&lt;h1&gt;Models&lt;/h1&gt;

&lt;p&gt;For the machine learning algorithms on top of my features, I experimented with two types of models: logistic regression (using both L1 and L2 regularization) and random forests. (If I had more time, I would probably have done some more parameter tuning and maybe tried gradient boosted trees as well.)&lt;/p&gt;

&lt;p&gt;So what is a random forest? I wrote an &lt;a href=&quot;http://www.quora.com/Random-Forests/How-do-random-forests-work-in-laymans-terms/answer/Edwin-Chen-1&quot;&gt;old (layman&amp;#8217;s) post&lt;/a&gt; on it &lt;a href=&quot;http://www.quora.com/Random-Forests/How-do-random-forests-work-in-laymans-terms/answer/Edwin-Chen-1&quot;&gt;here&lt;/a&gt;, but since nobody ever clicks on these links, let&amp;#8217;s copy it over:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Suppose you&amp;#8217;re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you&amp;#8217;ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you&amp;#8217;ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like &amp;#8220;Is X a romantic movie?&amp;#8221;, &amp;#8220;Does Johnny Depp star in X?&amp;#8221;, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.&lt;/p&gt;&lt;p&gt;    Thus, Willow is a decision tree for your movie preferences.&lt;/p&gt;&lt;p&gt;    But Willow is only human, so she doesn&amp;#8217;t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you&amp;#8217;d like to ask a bunch of your friends, and watch movie X if most of them say they think you&amp;#8217;ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you&amp;#8217;ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).&lt;/p&gt;&lt;p&gt;    Now you don&amp;#8217;t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you&amp;#8217;re not absolutely sure of your preferences yourself &amp;#8211; you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn&amp;#8217;t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you *really really* loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don&amp;#8217;t change your love/hate decisions, you just say you love/hate some movies a little more or less (you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don&amp;#8217;t mention Harry Potter at all.&lt;/p&gt;&lt;p&gt;    By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.&lt;/p&gt;&lt;p&gt;    There&amp;#8217;s still one problem with your data, however. While you loved both Titanic and Inception, it wasn&amp;#8217;t because you like movies that star Leonardio DiCaprio. Maybe you liked both movies for other reasons. Thus, you don&amp;#8217;t want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you&amp;#8217;re building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren&amp;#8217;t allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you&amp;#8217;re injecting randomness at the model level, by making your friends ask different questions at different times.&lt;/p&gt;&lt;p&gt;    And so your friends now form a random forest.&lt;/p&gt;&lt;/blockquote&gt;


&lt;p&gt;Moving on, I essentially trained &lt;a href=&quot;http://scikit-learn.org/stable/&quot;&gt;scikit-learn&lt;/a&gt;&amp;#8217;s classifiers on an equal split of true and false edges (sampled from the output of my pruning step, in order to match the distribution I&amp;#8217;d get when applying my algorithm to the official test set), and compared performance on the validation set I made, with a small amount of parameter tuning:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Random Forest&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;python&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c&quot;&gt;########################################&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c&quot;&gt;# STEP 1: Read in the training examples.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c&quot;&gt;########################################&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;n&quot;&gt;truths&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# A truth is 1 (for a known true edge) or 0 (for a false edge).&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;n&quot;&gt;training_examples&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# Each training example is an array of features.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TRAINING_SET_WITH_FEATURES_FILENAME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;line&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;,&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;truth&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;training_example_features&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:]&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;truths&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;truth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;training_examples&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;training_example_features&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c&quot;&gt;#############################&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c&quot;&gt;# STEP 2: Train a classifier.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c&quot;&gt;#############################&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;n&quot;&gt;rf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;RandomForestClassifier&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n_estimators&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;500&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;compute_importances&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;oob_score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;n&quot;&gt;rf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;training_examples&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;truths&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;So let&amp;#8217;s look at the variable importance scores as determined by one of my random forest models, which (unsurprisingly) consistently outperformed logistic regression.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/rf-importance-scores.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/rf-importance-scores.png&quot; alt=&quot;Random Forest Importance Scores&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The random forest classifier here is one of my earlier models (using a slightly smaller subset of my full suite of features), where the targeting step consisted of taking the top 25 nodes with the highest propagation scores.&lt;/p&gt;

&lt;p&gt;We can see that the most important variables are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Personalized PageRank scores. (I put in both normalized and unnormalized versions, where the normalized versions consisted of taking all the candidates for a particular source node, and scaling them so that the maximum personalized PageRank score was 1.)&lt;/li&gt;
&lt;li&gt;Whether the destination node already follows the source.&lt;/li&gt;
&lt;li&gt;How similar the source node is to the people the destination node is following, when each node is represented as a set of followers. (Note that this is more or less measuring how likely the destination is to follow the source, which we already saw is a good predictor of whether the source is likely to follow the destination.) Plus several variations on this theme (e.g., how similar the destination node is to the source node&amp;#8217;s followers, when each node is represented as a set of followees).&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;Model Comparison&lt;/h1&gt;

&lt;p&gt;How do all of these models compare to each other? Is the random forest model universally better than the logistic regression model, or are there some sets of users for which the logistic regression model actually performs better?&lt;/p&gt;

&lt;p&gt;To enable these kinds of comparisons, I made &lt;a href=&quot;http://link-prediction.herokuapp.com/comparison&quot;&gt;a small module&lt;/a&gt; that allows you to select two models and then visualize their sliced performance.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://link-prediction.herokuapp.com/comparison&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/kaggle-fb/pagerank_vs_is_followed_by_v2.png&quot; alt=&quot;PageRank vs. Is Followed By&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(Go ahead, &lt;a href=&quot;http://link-prediction.herokuapp.com/comparison&quot;&gt;play around&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;Above, I bucketed all test nodes into buckets based on (the logarithm of) their number of followers, and compared the mean average precision of two algorithms: one that recommends nodes to follow using a personalized PageRank alone, and one that recommends nodes that are following the source user but are not followed back in return.&lt;/p&gt;

&lt;p&gt;We see that except for the case of 0 followers (where the &amp;#8220;is followed by&amp;#8221; algorithm can do nothing), the personalized PageRank algorithm gets increasingly better in comparison: at first, the two algorithms have roughly equal performance, but as the source node gets more followers, the personalized PageRank algorithm dominates.&lt;/p&gt;

&lt;p&gt;And here&amp;#8217;s an embedded version you can interact with directly:&lt;/p&gt;

&lt;iframe width=&quot;600px&quot; height=&quot;500px&quot; src=&quot;http://link-prediction.herokuapp.com/comparison?for_embed=true&quot;&gt;&lt;/iframe&gt;


&lt;p&gt;Admittedly, building a slicer like this is probably overkill for a Kaggle competition, where the set of variables is fairly limited. But imagine having something similar for a real world model, where new algorithms are tried out every week and we can slice the performance by almost any dimension we can imagine (by geography, to make sure we don&amp;#8217;t improve Australia at the expense of the UK; by user interests, to see where we could improve the performance of topic inference; by number of user logins, to make sure we don&amp;#8217;t sacrifice the performance on new users for the gain of the core).&lt;/p&gt;

&lt;h1&gt;Mathematicians do it with Matrices&lt;/h1&gt;

&lt;p&gt;Let&amp;#8217;s switch directions slightly and think about how we could rewrite our computations in a different, matrix-oriented style. (I didn&amp;#8217;t do this in the competition &amp;#8211; this is more a preview of another post I&amp;#8217;m writing.)&lt;/p&gt;

&lt;h2&gt;Personalized PageRank in Scalding&lt;/h2&gt;

&lt;p&gt;Personalized PageRank, for example, is an obvious fit for a matrix rewrite. Here&amp;#8217;s how it would look in &lt;a href=&quot;http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/&quot;&gt;Scalding&lt;/a&gt;&amp;#8217;s new Matrix library:&lt;/p&gt;

&lt;p&gt;(For those who don&amp;#8217;t know, Scalding is a Hadoop framework that Twitter released at the beginning of the year; see &lt;a href=&quot;http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/&quot;&gt;my post on building a big data recommendation engine in Scalding&lt;/a&gt; for an introduction.)&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Personalized PageRank, Matrix Style&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;21&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;22&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;23&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;24&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;25&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;26&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;27&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;28&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;29&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;30&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;31&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;32&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;33&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;34&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;35&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;36&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;37&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;38&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;39&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;40&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;41&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;42&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;43&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;44&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;45&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;46&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;47&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;48&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;49&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;50&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;51&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;52&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;53&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;54&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;55&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;56&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;57&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;58&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;59&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;60&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;61&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;62&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;63&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;64&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;65&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;66&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;67&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;68&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// ***********************************************&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// STEP 1. Load the adjacency graph into a matrix.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// ***********************************************&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;following&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Tsv&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;GraphFilename&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;weight&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// Binary matrix where cell (u1, u2) means that u1 follows u2.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followingMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;following&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;,&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;,&lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;](&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;weight&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// Binary matrix where cell (u1, u2) means that u1 is followed by u2.  &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followerMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followingMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transpose&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// Note: we could also form this adjacency matrix differently, by placing&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// different weights on the following vs. follower edges.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;undirectedAdjacencyMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;followingMatrix&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followerMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rowL1Normalize&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// Create a diagonal users matrix (to be used in the &amp;quot;teleportation back&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// home&amp;quot; step).&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usersMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;following&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unique&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;           &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;weight&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user1&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;           &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;](&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;weight&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// ***************************************************&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// STEP 2. Compute the personalized PageRank scores.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// See http://nlp.stanford.edu/projects/pagerank.shtml&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// for more information on personalized PageRank.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// ***************************************************&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// Compute personalized PageRank by running for three iterations,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// and output the top candidates.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pprScores&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;personalizedPageRank&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;usersMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;undirectedAdjacencyMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usersMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.5&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;n&quot;&gt;pprScores&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;topRowElems&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numCandidates&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Tsv&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;OutputFilename&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Performs a personalized PageRank iteration. The ith row contains the&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * personalized PageRank probabilities around node i.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Note the interpretation: &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *   - with probability 1 - alpha, we go back to where we started.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *   - with probability alpha, we go to a neighbor.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Parameters:&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *   &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *   startMatrix - a (usually diagonal) matrix, where the ith row specifies&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *                 where the ith node teleports back to.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *   adjacencyMatrix&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *   prevMatrix - a matrix whose ith row contains the personalized PageRank&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *                probabilities around the ith node.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *   alpha - the probability of moving to a neighbor (as opposed to&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *           teleporting back to the start).&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *   numIterations - the number of personalized PageRank iterations to run. &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;personalizedPageRank&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;startMatrix&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Matrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                         &lt;span class=&quot;n&quot;&gt;adjacencyMatrix&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Matrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                         &lt;span class=&quot;n&quot;&gt;prevMatrix&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Matrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                         &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                         &lt;span class=&quot;n&quot;&gt;numIterations&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Matrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numIterations&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;n&quot;&gt;prevMatrix&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;updatedMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;startMatrix&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                          &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prevMatrix&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;adjacencyMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;n&quot;&gt;personalizedPageRank&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;startMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;adjacencyMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;updatedMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numIterations&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;Not only is this matrix formulation a more natural way of expressing the algorithm, but since Scalding (by way of Cascading) supports both local and distributed modes, this code runs just as easily on a Hadoop cluster of thousands of machines (assuming our social network is orders of magnitude larger than the one in the contest) as on a sample of data in a laptop. Big data, big matrix style, BOOM.&lt;/p&gt;

&lt;h2&gt;Cosine Similarity as L2-Normalized Multiplication&lt;/h2&gt;

&lt;p&gt;Here&amp;#8217;s another example. Calculating cosine similarity between all users is a natural fit for a matrix formulation since, after all, the cosine similarity between two vectors is just their L2-normalized dot product:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Cosine Similarity, Matrix Style&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// A matrix where the cell (i, j) is 1 iff user i is followed by user j.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followerMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&#8230;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// A matrix where cell (i, j) holds the cosine similarity between&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// user i and user j, when both are represented as sets of their followers.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followerBasedSimilarityMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;followerMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rowL2Normalize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followerMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rowL2Normalize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transpose&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;h2&gt;A Similarity Extension&lt;/h2&gt;

&lt;p&gt;But let&amp;#8217;s go one step further.&lt;/p&gt;

&lt;p&gt;To change examples for ease of exposition: suppose you&amp;#8217;ve bought a bunch of books on Amazon, and Amazon wants to recommend a new book you&amp;#8217;ll like. Since Amazon knows similarities between all pairs of books, one natural way to generate this recommendation is to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take every book B.&lt;/li&gt;
&lt;li&gt;Calculate the similarity between B and each book you bought.&lt;/li&gt;
&lt;li&gt;Sum up all these similarities to get your recommendation score for B.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;In other words, the recommendation score for book B on user U is:&lt;/p&gt;

&lt;p&gt;DidUserBuy(U, Book 1) * SimilarityBetween(Book B, Book 1) + DidUserBuy(U, Book 2) * SimilarityBetween(Book B, Book2) + &amp;#8230; + DidUserBuy(U, Book n) * SimilarityBetween(Book B, Book n)&lt;/p&gt;

&lt;p&gt;This, too, is a dot product! So it can also be rewritten as a matrix multiplication:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// A matrix where cell (i, j) holds the similarity between books i and j.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bookSimilarityMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&#8230;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// A matrix where cell (i, j) is 1 if user i has bought book j, &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// and 0 otherwise.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;userPurchaseMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&#8230;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// A matrix where cell (i, j) holds the recommendation score of&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// book j to user i.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;recommendationMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;userPurchaseMatrix&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bookSimilarityMatrix&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;Of course, there&amp;#8217;s a natural analogy between this score and the feature I described a while back above, where I compute a similarity score between a destination node and a source node&amp;#8217;s followees (when all nodes are represented as sets of followers):&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;21&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;22&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;23&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;24&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;25&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;26&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Iterate over each of user1&amp;#39;s followings, compute their similarity&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * with user2 when both are represented as sets of followers, and return&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * the sum of these similarities.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followerBasedSimilarityToFollowings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user1&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user2&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;implicit&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;similarity&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;SimilarityMetric&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;getFollowingsWithout&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;similarityByFollowers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;similarity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;                      &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * The matrix version of the above function.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Why are these the same? Note that the above function simply computes:&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *   DoesUserFollow(User A, User 1) * Similarity(User 1, User B) + &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *     DoesUserFollow(User A, User 2) * Similarity(User 2, User B) + &#8230; + &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; *     DoesUserFollow(User A, User n) * Similarity(User n, User B)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followingMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&#8230;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followerBasedSimilarityMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;followerMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rowL2Normalize&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followerMatrix&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rowL2Normalize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transpose&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followerBasedSimilarityToFollowingsMatrix&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;followingMatrix&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;followerBasedSimilarityMatrix&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;For people comfortable expressing their computations in a vector manner, writing your computations as matrix manipulations often makes experimenting with different algorithms much more fluid. Imagine, for example, that you want to switch from L1 normalization to L2 normalization, or that you want to express your objects as binary sets rather than weighted vectors. Both of these become simple one-line changes when you have vectors and matrices as first-class objects, but are much more tedious (&lt;em&gt;especially in a MapReduce land where this matrix library was designed to be applied!&lt;/em&gt;) when you don&amp;#8217;t.&lt;/p&gt;

&lt;h1&gt;Finish Line&lt;/h1&gt;

&lt;p&gt;By now, I think I&amp;#8217;ve spent more time writing this post than on the contest itself, so let&amp;#8217;s wrap up.&lt;/p&gt;

&lt;p&gt;I often get asked what kinds of tools I like to use, so for this competition my kit consisted of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scala, for code that needed to be fast (e.g., extracting features) or that I was going to run repeatedly (e.g., scoring my validation set).&lt;/li&gt;
&lt;li&gt;Python, for my machine learning models, because &lt;a href=&quot;http://scikit-learn.org/stable/&quot;&gt;scikit-learn&lt;/a&gt; is awesome.&lt;/li&gt;
&lt;li&gt;Ruby, for quick one-off scripts.&lt;/li&gt;
&lt;li&gt;R, for some data analysis and simple plotting.&lt;/li&gt;
&lt;li&gt;Coffeescript and d3, for the interactive visualizations.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Finally, I put up a &lt;a href=&quot;https://github.com/echen/link-prediction&quot;&gt;Github repository&lt;/a&gt; containing some code, and here are a couple other posts I&amp;#8217;ve written that people who like this entry might also enjoy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://blog.echen.me/2011/09/07/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/&quot;&gt;Information transmission in a social network&lt;/a&gt;, a case study in how information propagates through a social graph.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/&quot;&gt;Movie recommendations in Scalding&lt;/a&gt;, Twitter&amp;#8217;s Scala-based Hadoop framework built on top of Cascading.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/&quot;&gt;A summary of the algorithms behind the Netflix Prize&lt;/a&gt;, another crowdsourced recommendation contest for predicting movie ratings.&lt;/li&gt;
&lt;/ul&gt;

</content>
  </entry>
  
  <entry>
    <title>Soda vs. Pop with Twitter</title>
    <link href="http://blog.echen.me/2012/07/06/soda-vs-pop-with-twitter/"/>
    <updated>2012-07-06T10:51:00-07:00</updated>
    <id>http://blog.echen.me/2012/07/06/soda-vs-pop-with-twitter</id>
    <content type="html">&lt;p&gt;One of the great things about Twitter is that it&amp;#8217;s a global conversation anyone can join anytime. Eavesdropping on the world, what what!&lt;/p&gt;

&lt;p&gt;Of course, it gets even better when you can &lt;em&gt;mine&lt;/em&gt; all this chatter to study the way humans live and interact.&lt;/p&gt;

&lt;p&gt;For example, &lt;a href=&quot;http://blog.echen.me/2011/04/18/twifferences-between-californians-and-new-yorkers/&quot;&gt;how do people in New York City differ from those in Silicon Valley?&lt;/a&gt; We tend to think they&amp;#8217;re more financially driven and restless with the world &amp;#8211; is this true, and if so, &lt;a href=&quot;http://blog.echen.me/2011/04/18/twifferences-between-californians-and-new-yorkers/&quot;&gt;how much more&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;Or how does language change as you travel to different regions? Recall the classic soda vs. pop. vs. coke question: some people use the word &amp;#8220;soda&amp;#8221; to describe their soft drinks, others use &amp;#8220;pop&amp;#8221;, and still others use &amp;#8220;coke&amp;#8221;. Who says what where?&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s take a look.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/sodapop/united-states.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/sodapop/united-states.png&quot; alt=&quot;United States&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To make this map, I sampled geo-tagged tweets containing the words &amp;#8220;soda&amp;#8221;, &amp;#8220;pop&amp;#8221;, or &amp;#8220;coke&amp;#8221;, performed some state-of-the-art NLP technology to ensure the tweets were soft drink related (e.g., the tweets had to contain &amp;#8220;drink soda&amp;#8221; or &amp;#8220;drink a pop&amp;#8221;), and tried to filter out coke tweets that were specifically about the Coke brand (e.g., Coke Zero).&lt;/p&gt;

&lt;p&gt;It&amp;#8217;s a little cluttered, though, so let&amp;#8217;s clean it up by aggregating nearby tweets.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/sodapop/united_states_binned.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/sodapop/united_states_binned.png&quot; alt=&quot;United States Binned&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, I bucketed all tweets within a 0.333 latitude/longitude radius, calculated the term distribution within each bucket, and colored each bucket with the word furthest from its overall mean. I also sized each point according to the (log-transformed) number of tweets in the bucket.&lt;/p&gt;

&lt;p&gt;We can see that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The South is pretty Coke-heavy.&lt;/li&gt;
&lt;li&gt;Soda belongs to the Northeast and far West.&lt;/li&gt;
&lt;li&gt;Pop gets the mid-West, except for some interesting spots of blue around Wisconsin and the Illinois-Missouri border.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;For comparison, here&amp;#8217;s another map based on a survey at &lt;a href=&quot;http://www.popvssoda.com/&quot;&gt;popvssoda.com&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/sodapop/popvssoda.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/sodapop/popvssoda.png&quot; alt=&quot;Pop vs. Soda Map&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see similar patterns, though interestingly, our map has less Coke in the Southeast and less pop in the Northwest.&lt;/p&gt;

&lt;p&gt;Finally, here&amp;#8217;s a world map of the terms, bucketed again. Notice that &amp;#8220;pop&amp;#8221; seems to be prevalent only in parts of the United States and Canada.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/sodapop/world-map.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/sodapop/world-map.png&quot; alt=&quot;World&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As some astute readers noted, though, the seeming dominance of coke is probably due to the difficulty in distinguishing the generic use of coke for soft drinks in general from the particular use of coke for referring to the Coca-Cola brand.&lt;/p&gt;

&lt;p&gt;So let&amp;#8217;s instead look at a world map of a couple other soft drink terms (&amp;#8220;fizzy drink&amp;#8221;, &amp;#8220;mineral&amp;#8221;, and &amp;#8220;tonic&amp;#8221;):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://dl.dropbox.com/u/10506/blog/sodapop/fizzy-mineral-tonic.png&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/sodapop/fizzy-mineral-tonic.png&quot; alt=&quot;Fizzy Drink vs. Mineral vs. Tonic&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&amp;#8220;Fizzy drink&amp;#8221; shows up for the UK, New Zealand, and Maine.&lt;/li&gt;
&lt;li&gt;&amp;#8220;Tonic&amp;#8221; appears in Massachusetts.&lt;/li&gt;
&lt;li&gt;While South Africa gets &amp;#8220;fizzy drink&amp;#8221;, Nigeria gets &amp;#8220;mineral&amp;#8221;.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I&amp;#8217;ve been getting a lot of questions lately about interesting things you can do with the Twitter API, so this was just one small project I&amp;#8217;ve worked on to illustrate. &lt;a href=&quot;http://www.cc.gatech.edu/~jeisenst/papers/emnlp2010.pdf&quot;&gt;This paper&lt;/a&gt; contains another awesome application of Twitter data to geographic language variation, and just for fun, here are a few other cute mini-projects:&lt;/p&gt;

&lt;p&gt;What do people eat during the Super Bowl? (wings and beer, apparently)&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://twitter.com/echen/status/166343879547822080&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/sodapop/superbowl-snacks.png&quot; alt=&quot;Superbowl Snacks&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What do people want for Christmas, compared to what they actually get?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://twitter.com/echen/status/153683967315419136&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/blog/sodapop/xmas.png&quot; alt=&quot;Christmas&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What do guys and girls &lt;em&gt;really&lt;/em&gt; say?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://twitter.com/echen/status/261667822793551873/photo/1&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/twitter/shitguysandgirlssay.png&quot; alt=&quot;Shit Guys and Girls Say&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When were people losing and gaining power during Hurricane Sandy? (&lt;a href=&quot;http://blog.echen.me/hurricane-sandy-outages/&quot;&gt;click&lt;/a&gt; the image to interact)&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://blog.echen.me/hurricane-sandy-outages/&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/twitter/sandy-outages.png&quot; alt=&quot;Sandy Power Outages&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How does information of a &lt;em&gt;geographic&lt;/em&gt;-specific nature spread? (&lt;a href=&quot;http://hurricanesandy.herokuapp.com/&quot;&gt;click&lt;/a&gt; the image to see a dynamic visualization of when and where tweets related to surviving Hurricane Sandy were shared)&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://hurricanesandy.herokuapp.com/&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/twitter/sandy-spread.png&quot; alt=&quot;Hurricane Sandy Retweets&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Can we use Twitter to measure presidential votes? (yes!)&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://twitter.com/echen/status/265894918382305284/photo/1&quot;&gt;&lt;img src=&quot;https://dl.dropbox.com/u/10506/twitter/electoral-map.png&quot; alt=&quot;Electoral Map&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Making the Most of Mechanical Turk: Tips and Best Practices</title>
    <link href="http://blog.echen.me/2012/04/25/making-the-most-of-mechanical-turk-tips-and-best-practices/"/>
    <updated>2012-04-25T13:50:00-07:00</updated>
    <id>http://blog.echen.me/2012/04/25/making-the-most-of-mechanical-turk-tips-and-best-practices</id>
    <content type="html">&lt;p&gt;(Update: we recently open-sourced &lt;a href=&quot;https://github.com/twitter/clockworkraven&quot;&gt;Clockwork Raven&lt;/a&gt;, one of the human evaluation tools on top of Mechanical Turk that we built at Twitter.)&lt;/p&gt;

&lt;p&gt;Big data&amp;#8217;s all the rage, but sometimes a couple thousand &lt;em&gt;human&lt;/em&gt;-generated labels can be pretty effective as well. And since I&amp;#8217;ve been using Amazon&amp;#8217;s Mechanical Turk system a lot recently, I figured I&amp;#8217;d share some of the things I&amp;#8217;ve learned.&lt;/p&gt;

&lt;h1&gt;What is MTurk?&lt;/h1&gt;

&lt;p&gt;&lt;a href=&quot;https://www.mturk.com/mturk/welcome&quot;&gt;Mechanical Turk&lt;/a&gt; is a crowdsourcing system developed by Amazon that connects you to a relatively cheap source of human labor on the fly.&lt;/p&gt;

&lt;p&gt;For example, suppose you have 10,000 websites that you want to classify as spam or not. To get these classifications, you (the &lt;em&gt;Requester&lt;/em&gt;):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a CSV file containing the links and any other information.&lt;/li&gt;
&lt;li&gt;Log onto MTurk and create a &lt;em&gt;HIT&lt;/em&gt; (Human Intelligence Task) describing the job (possibly by using Amazon&amp;#8217;s WYSIWYG editor or writing your own HTML, which can refer to columns in your CSV). [There&amp;#8217;s also an MTurk API, if you don&amp;#8217;t want to use the terrible UI.]&lt;/li&gt;
&lt;li&gt;Within hours of starting the task, your judgments will be completed by &lt;em&gt;Turkers&lt;/em&gt; around the world for pennies each.&lt;/li&gt;
&lt;/ol&gt;


&lt;h1&gt;More Example Tasks&lt;/h1&gt;

&lt;p&gt;So what can you use MTurk for? Here are three of my favorite uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://boingboing.net/2011/02/18/straight-line-traced.html&quot;&gt;A Sequence of Lines Consecutively Traced by Five Hundred Individuals&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/mturk/lines.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/mturk/lines.png&quot; alt=&quot;Lines Mutation&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://www.thesheepmarket.com/&quot;&gt;The Sheep Market&lt;/a&gt;: asking Turkers to draw sheep&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/mturk/sheep.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/mturk/sheep.png&quot; alt=&quot;Sheep&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://groups.csail.mit.edu/uid/deneme/?p=329&quot;&gt;Blurry Text Transcription&lt;/a&gt; (Seriously! How is this possible?!)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/mturk/blurry.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/mturk/blurry.png&quot; alt=&quot;Blurry Text&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And here are some more practical tasks, from HITs running right now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Categorize the sentiment of a tweet towards Panera Bread&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/mturk/panera.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/mturk/panera.png&quot; alt=&quot;Panera&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Copy text from a business card&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/mturk/business-card.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/mturk/business-card.png&quot; alt=&quot;Business Card&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Judge entity relatedness&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/mturk/entity.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/mturk/entity.png&quot; alt=&quot;Angelina Jolie&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;Increasing the quality of your judgments&lt;/h1&gt;

&lt;p&gt;So what will the quality of your judgments look like?&lt;/p&gt;

&lt;p&gt;If you don&amp;#8217;t do anything special, then your output will contain a lot of garbage. I&amp;#8217;ve thrown out entire tasks because of scammers who spend less than 5 seconds on each judgment (Amazon records the time each worker spends) and submit random clicks as output (e.g., labeling Nike as a food category).&lt;/p&gt;

&lt;p&gt;Luckily, Amazon provides a few worker filters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can require that only Turkers who have received at least (say) &lt;strong&gt;99% approval rate on at least 10,000 judgments in the past&lt;/strong&gt; are allowed to work on your judgment. (If you see bad judgments from a worker, you can reject them and get your money back.)&lt;/li&gt;
&lt;li&gt;About a year ago, Amazon launched a &lt;strong&gt;&amp;#8220;categorization masters&amp;#8221; and &amp;#8220;photo masters&amp;#8221;&lt;/strong&gt; program, which allows only masters to work on your HITs. According to a chat with a member of the MTurk team, Amazon assigns these master badges by creating special tasks (anonymously, and for which Amazon already knows the answer) and measuring the quality of each worker&amp;#8217;s response to these tasks.&lt;/li&gt;
&lt;li&gt;You can also create a custom filter and &lt;strong&gt;handpick who gets allowed to work for you&lt;/strong&gt;, or set up a &lt;strong&gt;qualification test&lt;/strong&gt; that workers are required to take before working on your tasks.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I&amp;#8217;ve used different combinations of the first two filters, and gotten excellent results &amp;#8211; compared to in-house judges I&amp;#8217;ve worked with in person and paid \$20-30 an hour, the judgments on Mechanical Turk have been just as good and sometimes even better. (I often ask my judges to explain their judgments, which makes it easy to detect high quality workers.) For example, here are some typical response I&amp;#8217;ve received when asking judges to determine which of two products a given Twitter user might be more interested in:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;The user is a female obsessed with Twilight Movies and Rob Pattinson. She tweets and follows both subjects. Movie tickets would be interesting to her.&lt;/p&gt;&lt;/blockquote&gt;




&lt;blockquote&gt;&lt;p&gt;He doesn&amp;#8217;t seem to play video games, and he doesn&amp;#8217;t seem technical enough to care about running Windows on a Mac. Neither of these products are a good fit for him.&lt;/p&gt;&lt;/blockquote&gt;


&lt;p&gt;In fact, I&amp;#8217;ll frequently also get emails from Turkers giving me suggestions on how to improve my tasks or asking how they can do them better. (Amazon allows workers to email you. The only way for the requester to initiate a conversation, though, is by paying the worker a small bonus for excellent work, and including a message with the bonus.) Here are excerpts from some emails I&amp;#8217;ve received:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;I just wanted to check in to be sure that once I figured things out that I was doing your hits the way you intended them to be done. I want to be sure that you are getting the data that you need from the work. Please do not hesitate to let me know if there is anything that I can do to improve the way I am working your HITs. This is my full time job while I stay at home with my kids, so I like to check with the requesters to be sure that I am putting out the work that they are looking for. Any suggestion is welcome.&lt;/p&gt;&lt;/blockquote&gt;




&lt;blockquote&gt;&lt;p&gt;Frankly, lingerie, makeup, and feminine hygiene are the only male-exclusionary topics I can think of, and it feels knee-jerk sexist to mark any sports-related site for men. That said, should I hew more closely to gender stereotypes or be politically correct? (from a HIT where I was gathering gender classification data)&lt;/p&gt;&lt;/blockquote&gt;




&lt;blockquote&gt;&lt;p&gt;I do think a few more categories are needed but keeping the number down overall is good - 50 or 60 to choose from can be overwhelming and not worth the time. I may have mentioned I never used the Photography one (and I did a lot of those) so that is a good candidate for elimination.&lt;/p&gt;&lt;/blockquote&gt;


&lt;p&gt;That said, despite the approval rate filters and masters badges, I do occasionally get a couple scammers in the mix (or even just judges who don&amp;#8217;t produce as excellent work). So one suggestion is to run an initial task with these filters applied, find the workers with the best quality, and from then on use a custom pool containing these Turkers alone.&lt;/p&gt;

&lt;h1&gt;How much to pay&lt;/h1&gt;

&lt;p&gt;So how much should you pay your workers?&lt;/p&gt;

&lt;p&gt;New Turkers and Turkers who don&amp;#8217;t meet the strict filters can be paid less, but most of my high-quality workers expect to make about \$8-14 an hour. (You can only specify how much you pay per judgment, but Amazon will tell you how long each item ends up taking on average.) For example, here&amp;#8217;s what several Turkers said what I asked them directly how much they make:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Most of the work I do is either writing or editing.  When editing work is available, I make \$15-20 per hour.  I&amp;#8217;m a slower writer than an editor, so I average \$10-12 per hour with writing.  I also judge sentiments of messages and average about \$8 per hour with that type of work. I would like to average a minimum of no less than \$8 per hour.&lt;/p&gt;&lt;p&gt;A big factor in deciding to do a task or not comes from the time investment involved. The two big time sinks are either googling/searching/having to go to another site, or having to write something as part of your reply. If I remember correctly, a) your tasks did require looking at another page but either the link was right there OR, better yet, you had that page embedded in the HIT itself so clicking out of the window wasn&amp;#8217;t necessary (turkers get very excited about this), and b) the quality of the pay rate was such that it easily outweighed the time it took to leave an explanatory comment. &lt;/p&gt;&lt;p&gt;For me at least, those things can&amp;#8217;t be underestimated. Sure, your tasks may be a little time-consuming, but I figure a good task is one I can make 10 to 12 cents a minute on. Your task might take longer but I&amp;#8217;m definitely still coming out ahead.&lt;/p&gt;&lt;p&gt;From my own experience, I work hardest and best for a requester that pays well and doesn&amp;#8217;t reject (or at least seems to have a reason for a rejection when it happens). If a requester is going to accept the majority of my work, I as a worker feel that obligates me to provide them with the best quality possible. Similarly, although I&amp;#8217;m conscientious with all tasks, I&amp;#8217;m especially so with a high-paying one: it would be easy to take advantage of a high-pay, low-reject requester - which would ultimately lead them to either lower the pay or change the acceptance criteria. I don&amp;#8217;t want that!! That&amp;#8217;s the kind of requester I want around. I&amp;#8217;m grateful for high pay and fair policies and that kind of requester gets an above-and-beyond effort from me.&lt;/p&gt;&lt;/blockquote&gt;




&lt;blockquote&gt;&lt;p&gt;For the pay, I have worked on master&amp;#8217;s hits that have ranged from \$6-\$16 per hour. Averaging them out works out to around \$9, which isn&amp;#8217;t a bad wage. I have two requesters that I work for that don&amp;#8217;t use the master qualification but instead have closed qualifications that they&amp;#8217;ve assigned to their best workers. Those tasks pay between \$12 and \$15 per hour, so no matter what I&amp;#8217;m working on I will stop what I&amp;#8217;m doing to work on them. The best paying hits are always done very quickly, so most of the time if you check out mturk and look at the tasks available you won&amp;#8217;t get a very good idea of average pay because the terrible paying hits will sit on the board until they expire.&lt;/p&gt;&lt;/blockquote&gt;


&lt;p&gt;Obviously, this is self-reported, so there&amp;#8217;s a strong possibility that the Turkers are artificially inflating their numbers. But this does match what I&amp;#8217;ve been told by a manager on the MTurk team, as well as what Turkers self-report on &lt;a href=&quot;http://turkernation.com/&quot;&gt;TurkerNation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A good suggestion regarding pay is to start at the lower end of the scale, around \$6-8 per hour, and increase that until you get both the quality and speed you want.&lt;/p&gt;

&lt;h1&gt;Other design tips&lt;/h1&gt;

&lt;p&gt;Interestingly, according to what Turkers (see the excerpts above) and my Amazon contact say, as well as other research I&amp;#8217;ve seen (e.g., &lt;a href=&quot;http://groups.csail.mit.edu/uid/deneme/?p=680&quot;&gt;this paper&lt;/a&gt;), pay is &lt;em&gt;not&lt;/em&gt; at the absolute forefront of Turkers&amp;#8217; minds when they decide what to work on. Instead, they focus more on requesters they&amp;#8217;ve already established a good relationship with, HITs with many items (so they can quickly settle into a rhythm), HITs they know they&amp;#8217;ll be paid for (so they&amp;#8217;re not worried about rejections), and HITs that they generally enjoy doing more.&lt;/p&gt;

&lt;p&gt;So here are a couple suggestions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If your task is hard and there&amp;#8217;s no clearly correct answer, even good Turkers might be worried that you&amp;#8217;ll reject their judgments (and so they might skip over your HIT). So make it clear in your instructions that you won&amp;#8217;t reject any judgments, or that you won&amp;#8217;t reject any judgments with an honest effort.&lt;/li&gt;
&lt;li&gt;Make your instructions collapsible, or link to them in a separate site. Scrolling is kind of annoying on Mechanical Turk (I know &amp;#8211; I&amp;#8217;ve tried working on HITs myself), so you should minimize the amount workers have to scroll. Ideally, everything fits on a single screen. Plus, the less workers have to scroll, the faster your HITs will get done. For example, here are excerpts from emails I received from two different Turkers when I first started out:&lt;/li&gt;
&lt;/ul&gt;


&lt;blockquote&gt;&lt;p&gt;I have a suggestion that would really make things go a little quicker. Is there anyway you could script the twitter link to automatically open in a new tab? It amazes me how much it can slow you down to have to right click and open it manually in another tab, and when you forget, you have to take a few more steps to get back to where you were.&lt;/p&gt;&lt;/blockquote&gt;




&lt;blockquote&gt;&lt;p&gt;It would be amazing if the Twitter account could be on the same page instead of having to click to get to another screen - the work would go *exponentially* faster! Overall, I&amp;#8217;m enjoying them - and I&amp;#8217;m not the only one. Despite your stringent requirements these are disappearing pretty quickly.&lt;/p&gt;&lt;/blockquote&gt;


&lt;ul&gt;
&lt;li&gt;Introduce yourself on &lt;a href=&quot;http://turkernation.com/&quot;&gt;TurkerNation&lt;/a&gt;, a forum where Turkers and Requesters go to talk about Mechanical Turk. This helps establish your reputation as a good requester who listens to feedback, which will make good Turkers want to work for you. (More on this below.)&lt;/li&gt;
&lt;li&gt;Approve judgments quickly: Turkers want &lt;a href=&quot;http://en.wikipedia.org/wiki/Hyperbolic_discounting&quot;&gt;money now instead of money later&lt;/a&gt;. For example, one worker told me:&lt;/li&gt;
&lt;/ul&gt;


&lt;blockquote&gt;&lt;p&gt;Quick approval is important, too. Watching that money pile up is a serious motivator; I&amp;#8217;ll sometimes choose a lower-paying task that approves in close to real time over a higher-paying one that won&amp;#8217;t pay out for several days.&lt;/p&gt;&lt;/blockquote&gt;


&lt;p&gt;When using my trusted set of workers, I let Amazon auto-approve all judgments within a couple hours.&lt;/p&gt;

&lt;h1&gt;Reputation&lt;/h1&gt;

&lt;p&gt;Reputation is pretty important. Turkers love requesters who take the time to respond to emails and incorporate suggestions. Excerpts from emails I&amp;#8217;ve received:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;I LOVE it when requesters care enough to ask the opinion of us lowly turkers and am more than willing to take a few minutes to help them with anything. I look forward to seeing what you cook up!&lt;/p&gt;&lt;/blockquote&gt;




&lt;blockquote&gt;&lt;p&gt;Thanks for taking the time to try to make your hits better in both pay and design. It&amp;#8217;s great to see a requester that actually cares, when most don&amp;#8217;t. If you have any other questions for me, feel free to ask. I hope to work for you again soon.&lt;/p&gt;&lt;/blockquote&gt;




&lt;blockquote&gt;&lt;p&gt;From my own experience, I work hardest and best for a requester that pays well and doesn&amp;#8217;t reject (or at least seems to have a reason for a rejection when it happens). If a requester is going to accept the majority of my work, I as a worker feel that obligates me to provide them with the best quality possible. Similarly, although I&amp;#8217;m conscientious with all tasks, I&amp;#8217;m especially so with a high-paying one: it would be easy to take advantage of a high-pay, low-reject requester - which would ultimately lead them to either lower the pay or change the acceptance criteria. I don&amp;#8217;t want that!! That&amp;#8217;s the kind of requester I want around. I&amp;#8217;m grateful for high pay and fair policies and that kind of requester gets an above-and-beyond effort from me.&lt;/p&gt;&lt;/blockquote&gt;


&lt;p&gt;I&amp;#8217;ve gotten great suggestions from a lot of Turkers (sometimes, when launching a new type of experiment, I&amp;#8217;ll do a quick trial run in order to get some fast feedback before spending more time on the HIT design), and I suspect it&amp;#8217;s partly because I&amp;#8217;ve taken the time to connect with my workers.&lt;/p&gt;

&lt;p&gt;So, as suggested above, one way of quickly garnering some goodwill when you&amp;#8217;re first getting started is to make a post introducing yourself on TurkerNation. (There&amp;#8217;s a &lt;a href=&quot;http://turkernation.com/forumdisplay.php?23-Requester-Introductions&quot;&gt;sub-forum&lt;/a&gt; devoted to this exact purpose, in fact.)&lt;/p&gt;

&lt;p&gt;This is useful because workers will often start new threads recommending particular requesters and encouraging other Turkers to work for them. In the amusing thread praising me, for example, one worker mentioned that she&amp;#8217;d been hesitant to work on my HITs until she saw the post confirming I was a good requester.&lt;/p&gt;

&lt;p&gt;Also, many Turkers mention that they always refer to &lt;a href=&quot;http://turkopticon.differenceengines.com/&quot;&gt;Turkopticon&lt;/a&gt;, a Firefox extension that displays ratings of requesters by other Turkers, before accepting work from a requester they haven&amp;#8217;t worked for before.&lt;/p&gt;

&lt;p&gt;This is what TurkOpticon looks like:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/mturk/jimyoung.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/mturk/jimyoung.png&quot; alt=&quot;Jim Young&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/mturk/productrnr.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/mturk/productrnr.png&quot; alt=&quot;ProductRNR&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are some comments about TurkOpticon on TurkerNation:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;I think that it is well worth taking the time to check reputation of requesters via TurkOpticon and/or in this forum. Checking first substantially minimizes your risk of rejection, of being blocked, and of being paid sub-human wages.&lt;/p&gt;&lt;/blockquote&gt;




&lt;blockquote&gt;&lt;p&gt;Blindly doing hits for requesters that were never heard of before got me with a pretty bad approval rate when I first started turking. After that, I rigorously inspect every requester that doesn&amp;#8217;t have any ratings on Turkopticon. Actually, because of that little add-on I&amp;#8217;ve been able to maintain a steady 98-99% approval rate ever since I began using it.&lt;/p&gt;&lt;/blockquote&gt;


&lt;h1&gt;Waiting Time&lt;/h1&gt;

&lt;p&gt;So how long does it take to get judgments? I&amp;#8217;ve restricted the available worker pool pretty strongly to ensure high quality, and it&amp;#8217;s still only taken a few hours to get a thousand judgments.&lt;/p&gt;

&lt;p&gt;That&amp;#8217;s pretty awesome. I&amp;#8217;ve worked a lot with human evaluation systems before, but always using a small in-house set of judges &amp;#8211; and what with constraints on when those judges were available, how much they were able to work each week, and other tasks taking higher priority, it&amp;#8217;d invariably take at least a few days before I&amp;#8217;d receive any useful data back.&lt;/p&gt;

&lt;p&gt;Getting thousands of judgments in a couple hours means I can launch an MTurk task when I leave for work in the morning and have it done before lunch, which makes experimenting with a lot of different ideas much faster and easier.&lt;/p&gt;

&lt;h1&gt;Scale&lt;/h1&gt;

&lt;p&gt;So how many judgments can you actually get before you run out of workers? I&amp;#8217;m still a small fish in the MTurk system, but I&amp;#8217;m told by my MTurk contact at Amazon that there are companies getting over a million judgments each month.&lt;/p&gt;

&lt;p&gt;I also asked my pool of workers how much they&amp;#8217;re available to work, in case I would need to scale up to more judgments later on, and here are some samples from what they said:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Typically, I work a total of 20-25 hours per week for a small select group of requesters.  I could put in at least 20 hours per week for you alone if you were to make a custom qualification for me.  If I know that I can continue to do exemplary work beyond 20 hours, I would be willing to put forth more hours of work.  I want to make sure that you are getting the quality of work that you need.&lt;/p&gt;&lt;/blockquote&gt;




&lt;blockquote&gt;&lt;p&gt;On a day when I don&amp;#8217;t have those other assignments, I&amp;#8217;d guess I&amp;#8217;m turking 5 to 7 hours a day  (including weekends). I like to look for a large batch of HITs (preferably in the thousands) so that I can settle into a groove of being able to do them fairly quickly and once I find something like that I can happily settle in for several hours at a time.&lt;/p&gt;&lt;/blockquote&gt;




&lt;blockquote&gt;&lt;p&gt;I spend more time than the average person on mturk. I log on at about 5:30 AM and am constantly checking for work throughout the day. If the work is available, I will spend until 9PM working. Granted, I do have to take some breaks throughout the day to take care of my 3 year old, but for the most part, I am doing my best to earn while the hits are posted. If I take any time off, it is on the weekend (if I reach my earning goals for the week).&lt;/p&gt;&lt;/blockquote&gt;




&lt;blockquote&gt;&lt;p&gt;Of course, how much I can work varies. My main source of income is transcription for a market research company and mturk fills in my downtime. If I have an audio file from them, that gets my attention. If not, I&amp;#8217;m on mturk. As a single mother working from home, I love the flexibility.&lt;/p&gt;&lt;/blockquote&gt;


&lt;h1&gt;End&lt;/h1&gt;

&lt;p&gt;I&amp;#8217;ll end with a couple other notes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do other companies use human evaluation systems? Google and Bing use human judgments in their search metrics, though I think they use an in-house set of judges rather than Mechanical Turk. I&amp;#8217;ve heard Aardvark and Quora used Mechanical Turk to seed answers when they first launched their sites. There&amp;#8217;s also a nice set of case studies &lt;a href=&quot;http://aws.amazon.com/solutions/case-studies/&quot;&gt;here&lt;/a&gt; (search for the &amp;#8220;On-Demand Workforce&amp;#8221; section); in particular, Knewton&amp;#8217;s use of &lt;a href=&quot;http://aws.amazon.com/solutions/case-studies/knewton/&quot;&gt;MTurk for performance and QA testing&lt;/a&gt; is pretty interesting.&lt;/li&gt;
&lt;li&gt;I&amp;#8217;ve described one way of finding good workers, namely, using the filters Amazon provides. Another way could be to build a reputation system yourself, perhaps using an EM-style algorithm to determine judge quality.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://crowdflower.com/&quot;&gt;Crowdflower&lt;/a&gt; is another crowdsourcing system. There are a couple differences with MTurk:

&lt;ul&gt;
&lt;li&gt;Crowdflower&amp;#8217;s worker pool consists of about 20 different sources, including Mechanical Turk, as well as sources like TrialPay (people can opt to complete a MTurk task to receive some kind of TrialPay deal).&lt;/li&gt;
&lt;li&gt;Crowdflower offers both a self-serve platform (like MTurk), as well as a more enterprise-centric solution (where you work directly with a Crowdflower employee). The enterprise offering is pretty nice, since that means Crowdflower will take care of the lower-level details for you (like actually designing and creating the job), and they can offer suggestions for designing the HIT based on their experience.&lt;/li&gt;
&lt;li&gt;Crowdflower provides the option of adding gold standard judgments to your task (items where you provide a golden answer, which are then randomly shown to workers; these are then used to monitor judges) and they try to automatically determine judge quality and item accuracy for you (e.g., by having each item judged by three different workers).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;An excellent crowdsourcing resource is &lt;a href=&quot;http://crowdscope.org/index.php?title=Main_Page&quot;&gt;CrowdScope&lt;/a&gt;. I also like the &lt;a href=&quot;http://groups.csail.mit.edu/uid/deneme/&quot;&gt;Deneme&lt;/a&gt; blog (though it hasn&amp;#8217;t been updated in a while) for a lot of fun experiments. Panos Ipeirotis&amp;#8217; &lt;a href=&quot;http://www.behind-the-enemy-lines.com/&quot;&gt;blog&lt;/a&gt; has good information as well.&lt;/li&gt;
&lt;/ul&gt;

</content>
  </entry>
  
  <entry>
    <title>Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process</title>
    <link href="http://blog.echen.me/2012/03/20/infinite-mixture-models-with-nonparametric-bayes-and-the-dirichlet-process/"/>
    <updated>2012-03-20T09:14:00-07:00</updated>
    <id>http://blog.echen.me/2012/03/20/infinite-mixture-models-with-nonparametric-bayes-and-the-dirichlet-process</id>
    <content type="html">&lt;p&gt;Imagine you&amp;#8217;re a budding chef. A data-curious one, of course, so you start by taking a set of foods (pizza, salad, spaghetti, etc.) and ask 10 friends how much of each they ate in the past day.&lt;/p&gt;

&lt;p&gt;Your goal: to find natural &lt;em&gt;groups&lt;/em&gt; of foodies, so that you can better cater to each cluster&amp;#8217;s tastes. For example, your fratboy friends might love &lt;a href=&quot;https://twitter.com/#!/edchedch/status/166343879547822080&quot;&gt;wings and beer&lt;/a&gt;, your anime friends might love soba and sushi, your hipster friends probably dig tofu, and so on.&lt;/p&gt;

&lt;p&gt;So how can you use the data you&amp;#8217;ve gathered to discover different kinds of groups?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/clustering-example.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/clustering-example.png&quot; alt=&quot;Clustering Example&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One way is to use a standard clustering algorithm like &lt;strong&gt;k-means&lt;/strong&gt; or &lt;strong&gt;Gaussian mixture modeling&lt;/strong&gt; (see &lt;a href=&quot;http://blog.echen.me/2011/03/19/counting-clusters/&quot;&gt;this previous post&lt;/a&gt; for a brief introduction). The problem is that these both assume a &lt;em&gt;fixed&lt;/em&gt; number of clusters, which they need to be told to find. There are a couple methods for selecting the number of clusters to learn (e.g., the &lt;a href=&quot;http://blog.echen.me/2011/03/19/counting-clusters/&quot;&gt;gap and prediction strength statistics&lt;/a&gt;), but the problem is a more fundamental one: most real-world data simply doesn&amp;#8217;t have a fixed number of clusters.&lt;/p&gt;

&lt;p&gt;That is, suppose we&amp;#8217;ve asked 10 of our friends what they ate in the past day, and we want to find groups of eating preferences. There&amp;#8217;s really an infinite number of foodie types (carnivore, vegan, snacker, Italian, healthy, fast food, heavy eaters, light eaters, and so on), but with only 10 friends, we simply don&amp;#8217;t have enough data to detect them all. (Indeed, we&amp;#8217;re limited to 10 clusters!) So whereas k-means starts with the incorrect assumption that there&amp;#8217;s a fixed, finite number of clusters that our points come from, &lt;em&gt;no matter if we feed it more data&lt;/em&gt;, what we&amp;#8217;d really like is a method positing an infinite number of hidden clusters that naturally arise as we ask more friends about their food habits. (For example, with only 2 data points, we might not be able to tell the difference between vegans and vegetarians, but with 200 data points, we probably could.)&lt;/p&gt;

&lt;p&gt;Luckily for us, this is precisely the purview of &lt;strong&gt;nonparametric Bayes&lt;/strong&gt;.*&lt;/p&gt;

&lt;p&gt;*Nonparametric Bayes refers to a class of techniques that allow some parameters to change with the data. In our case, for example, instead of fixing the number of clusters to be discovered, we allow it to grow as more data comes in.&lt;/p&gt;

&lt;h1&gt;A Generative Story&lt;/h1&gt;

&lt;p&gt;Let&amp;#8217;s describe a generative model for finding clusters in any set of data. We assume an infinite set of latent groups, where each group is described by some set of parameters. For example, each group could be a Gaussian with a specified mean $\mu_i$ and standard deviation $\sigma_i$, and these group parameters themselves are assumed to come from some base distribution $G_0$. Data is then generated in the following manner:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Select a cluster.&lt;/li&gt;
&lt;li&gt;Sample from that cluster to generate a new point.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;(Note the resemblance to a &lt;a href=&quot;http://en.wikipedia.org/wiki/Mixture_model&quot;&gt;finite mixture model&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;For example, suppose we ask 10 friends how many calories of pizza, salad, and rice they ate yesterday. Our groups could be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Gaussian centered at (pizza = 5000, salad = 100, rice = 500) (i.e., a pizza lovers group).&lt;/li&gt;
&lt;li&gt;A Gaussian centered at (pizza = 100, salad = 3000, rice = 1000) (maybe a vegan group).&lt;/li&gt;
&lt;li&gt;A Gaussian centered at (pizza = 100, salad = 100, rice = 10000) (definitely Asian).&lt;/li&gt;
&lt;li&gt;&amp;#8230;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;When deciding what to eat when she woke up yesterday, Alice could have thought &lt;em&gt;girl, I&amp;#8217;m in the mood for pizza&lt;/em&gt; and her food consumption yesterday would have been a sample from the pizza Gaussian. Similarly, Bob could have spent the day in Chinatown, thereby sampling from the Asian Gaussian for his day&amp;#8217;s meals. And so on.&lt;/p&gt;

&lt;p&gt;The big question, then, is: how do we assign each friend to a group?&lt;/p&gt;

&lt;h1&gt;Assigning Groups&lt;/h1&gt;

&lt;h2&gt;Chinese Restaurant Process&lt;/h2&gt;

&lt;p&gt;One way to assign friends to groups is to use a &lt;strong&gt;Chinese Restaurant Process&lt;/strong&gt;. This works as follows: Imagine a restaurant where all your friends went to eat yesterday&amp;#8230;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Initially the restaurant is empty.&lt;/li&gt;
&lt;li&gt;The first person to enter (Alice) sits down at a table (selects a group). She then orders food for the table (i.e., she selects parameters for the group); everyone else who joins the table will then be limited to eating from the food she ordered.&lt;/li&gt;
&lt;li&gt;The second person to enter (Bob) sits down at a table. Which table does he sit at? With probability $\alpha / (1 + \alpha)$ he sits down at a new table (i.e., selects a new group) and orders food for the table; with probability $1 / (1 + \alpha)$ he sits with Alice and eats from the food she&amp;#8217;s already ordered (i.e., he&amp;#8217;s in the same group as Alice).&lt;/li&gt;
&lt;li&gt;&amp;#8230;&lt;/li&gt;
&lt;li&gt;The (n+1)-st person sits down at a new table with probability $\alpha / (n + \alpha)$, and at table k with probability $n_k / (n + \alpha)$, where $n_k$ is the number of people currently sitting at table k.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Note a couple things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The more people (data points) there are at a table (cluster), the more likely it is that people (new data points) will join it. In other words, our groups satisfy a &lt;strong&gt;rich get richer&lt;/strong&gt; property.&lt;/li&gt;
&lt;li&gt;There&amp;#8217;s always a small probability that someone joins an entirely new table (i.e., a new group is formed).&lt;/li&gt;
&lt;li&gt;The probability of a new group depends on $\alpha$. So we can think of $\alpha$ as a &lt;strong&gt;dispersion parameter&lt;/strong&gt; that affects the dispersion of our datapoints. The lower alpha is, the more tightly clustered our data points; the higher it is, the more clusters we have in any finite set of points.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;(Also notice the resemblance between table selection probabilities and a Dirichlet distribution&amp;#8230;)&lt;/p&gt;

&lt;p&gt;Just to summarize, given n data points, the Chinese Restaurant Process specifies a distribution over partitions (table assignments) of these points. We can also generate parameters for each partition/table from a base distribution $G_0$ (for example, each table could represent a Gaussian whose mean and standard deviation are sampled from $G_0$), though to be clear, this is not part of the CRP itself.&lt;/p&gt;

&lt;h3&gt;Code&lt;/h3&gt;

&lt;p&gt;Since code makes everything better, here&amp;#8217;s some Ruby to simulate a CRP:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Chinese Restaurant Process &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/dirichlet-process/blob/master/chinese_restaurant_process.rb&#8217;&gt;chinese_restaurant_process.rb&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;21&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;22&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;23&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;24&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;25&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;26&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;27&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;ruby&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# Generate table assignments for `num_customers` customers, according to&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# a Chinese Restaurant Process with dispersion parameter `alpha`.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;#&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# returns an array of integer table assignments&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;chinese_restaurant_process&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_customers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_customers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt; &lt;span class=&quot;n&quot;&gt;table_assignments&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# first customer sits at table 1&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt; &lt;span class=&quot;n&quot;&gt;next_open_table&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# index of the next empty table&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt; &lt;span class=&quot;c1&quot;&gt;# Now generate table assignments for the rest of the customers.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;upto&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_customers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;   &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;rand&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_f&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;     &lt;span class=&quot;c1&quot;&gt;# Customer sits at new table.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;     &lt;span class=&quot;n&quot;&gt;table_assignments&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;next_open_table&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;     &lt;span class=&quot;n&quot;&gt;next_open_table&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;   &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;     &lt;span class=&quot;c1&quot;&gt;# Customer sits at an existing table.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;     &lt;span class=&quot;c1&quot;&gt;# He chooses which table to sit at by giving equal weight to each&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;     &lt;span class=&quot;c1&quot;&gt;# customer already sitting at a table. &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;     &lt;span class=&quot;n&quot;&gt;which_table&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;table_assignments&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;table_assignments&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;     &lt;span class=&quot;n&quot;&gt;table_assignments&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;which_table&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;   &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt; &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt; &lt;span class=&quot;n&quot;&gt;table_assignments&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;And here&amp;#8217;s some sample output:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Chinese Restaurant Process &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/dirichlet-process/blob/master/chinese_restaurant_process.rb&#8217;&gt;chinese_restaurant_process.rb&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;ruby&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;chinese_restaurant_process&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_customers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# table assignments from run 1&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# table assignments from run 2&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# table assignments from run 3&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;chinese_restaurant_process&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_customers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;chinese_restaurant_process&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_customers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;Notice that as we increase $\alpha$, so too does the number of distinct tables increase.&lt;/p&gt;

&lt;h2&gt;Polya Urn Model&lt;/h2&gt;

&lt;p&gt;Another method for assigning friends to groups is to follow the &lt;strong&gt;Polya Urn Model&lt;/strong&gt;. This is basically the same model as the Chinese Restaurant Process, just with a different metaphor.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We start with an urn containing $\alpha G_0(x)$ balls of &amp;#8220;color&amp;#8221; $x$, for each possible value of $x$. ($G_0$ is our base distribution, and $G_0(x)$ is the probability of sampling $x$ from $G_0$). Note that these are possibly fractional balls.&lt;/li&gt;
&lt;li&gt;At each time step, draw a ball from the urn, note its color, and then drop both the original ball plus a new ball of the same color back into the urn.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Note the connection between this process and the CRP: balls correspond to people (i.e., data points), colors correspond to table assignments (i.e., clusters), alpha is again a dispersion parameter (put differently, a prior), colors satisfy a rich-get-richer property (since colors with many balls are more likely to get drawn), and so on. (Again, there&amp;#8217;s also a connection between this urn model and &lt;a href=&quot;http://en.wikipedia.org/wiki/Dirichlet_distribution#P.C3.B3lya.27s_urn&quot;&gt;the urn model for the (finite) Dirichlet distribution&lt;/a&gt;&amp;#8230;)&lt;/p&gt;

&lt;p&gt;To be precise, the difference between the CRP and the Polya Urn Model is that the CRP specifies only a distribution over &lt;em&gt;partitions&lt;/em&gt; (i.e., table assignments), but doesn&amp;#8217;t assign parameters to each group, whereas the Polya Urn Model does both.&lt;/p&gt;

&lt;h3&gt;Code&lt;/h3&gt;

&lt;p&gt;Again, here&amp;#8217;s some code for simulating a Polya Urn Model:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Polya Urn Model &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/dirichlet-process/blob/master/polya_urn_model.rb&#8217;&gt;polya_urn_model.rb&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;21&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;22&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;23&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;ruby&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# Draw `num_balls` colored balls according to a Polya Urn Model&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# with a specified base color distribution and dispersion parameter&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# `alpha`.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;#&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# returns an array of ball colors&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;polya_urn_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;base_color_distribution&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_balls&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_balls&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;balls_in_urn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;upto&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_balls&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;rand&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_f&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;balls_in_urn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;# Draw a new color, put a ball of this color in the urn.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;n&quot;&gt;new_color&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;base_color_distribution&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;call&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;n&quot;&gt;balls_in_urn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_color&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;# Draw a ball from the urn, add another ball of the same color.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;n&quot;&gt;ball&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;balls_in_urn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;balls_in_urn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;n&quot;&gt;balls_in_urn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ball&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;balls_in_urn&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;And here&amp;#8217;s some sample output, using a uniform distribution over the unit interval as the color distribution to sample from:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Polya Urn Model &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/dirichlet-process/blob/master/polya_urn_model.rb&#8217;&gt;polya_urn_model.rb&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;ruby&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unit_uniform&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;rand&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;polya_urn_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unit_uniform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_balls&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;alpha&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;27&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;89&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;89&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;89&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;73&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;98&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;43&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;98&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;89&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;53&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# colors in the urn from run 1&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;26&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;26&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;46&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;26&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;26&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;26&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;26&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;26&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;26&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;85&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# colors in the urn from run 2&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;87&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;87&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;87&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# colors in the urn from run 3&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;h3&gt;Code, Take 2&lt;/h3&gt;

&lt;p&gt;Here&amp;#8217;s the same code for a Polya Urn Model, but in R:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Polya Urn Model &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/dirichlet-process/blob/master/polya_urn_model.R&#8217;&gt;polya_urn_model.R&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# Return a vector of `num_balls` ball colors according to a Polya Urn Model&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# with dispersion `alpha`, sampling from a specified base color distribution.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;polya_urn_model &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;base_color_distribution&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; num_balls&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; alpha&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  balls &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;kr&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;i in &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;:num_balls&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;kr&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;runif&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; alpha &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;alpha &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; length&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;balls&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;# Add a new ball color.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      new_color &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; base_color_distribution&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      balls &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;balls&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; new_color&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;# Pick out a ball from the urn, and add back a&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;# ball of the same color.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      ball &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; balls&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;sample&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;:length&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;balls&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      balls &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;balls&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; ball&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  balls
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;Here are some sample density plots of the colors in the urn, when using a unit normal as the base color distribution:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_1.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_1.png&quot; alt=&quot;Polya Urn Model, Alpha = 1&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_5.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_5.png&quot; alt=&quot;Polya Urn Model, Alpha = 5&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_25.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_25.png&quot; alt=&quot;Polya Urn Model, Alpha = 25&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_50.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_50.png&quot; alt=&quot;Polya Urn Model, Alpha = 50&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice that as alpha increases (i.e., we sample more new ball colors from our base; i.e., as we place more weight on our prior), the colors in the urn tend to a unit normal (our base color distribution).&lt;/p&gt;

&lt;p&gt;And here are some sample plots of points generated by the urn, for varying values of alpha:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each color in the urn is sampled from a uniform distribution over [0,10]x[0,10] (i.e., a [0, 10] square).&lt;/li&gt;
&lt;li&gt;Each group is a Gaussian with standard deviation 0.1 and mean equal to its associated color, and these Gaussian groups generate points.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/alpha-0.1.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/alpha-0.1.png&quot; alt=&quot;Alpha 0.1&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/alpha-0.2.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/alpha-0.2.png&quot; alt=&quot;Alpha 0.2&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/alpha-0.3.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/alpha-0.3.png&quot; alt=&quot;Alpha 0.3&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/alpha-0.5.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/alpha-0.5.png&quot; alt=&quot;Alpha 0.5&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/alpha-1.0.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/alpha-1.0.png&quot; alt=&quot;Alpha 1.0&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice that the points clump together in fewer clusters for low values of alpha, but become more dispersed as alpha increases.&lt;/p&gt;

&lt;h2&gt;Stick-Breaking Process&lt;/h2&gt;

&lt;p&gt;Imagine running either the Chinese Restaurant Process or the Polya Urn Model without stop. For each group $i$, this gives a proportion $w_i$ of points that fall into group $i$.&lt;/p&gt;

&lt;p&gt;So instead of running the CRP or Polya Urn model to figure out these proportions, can we simply generate them directly?&lt;/p&gt;

&lt;p&gt;This is exactly what the Stick-Breaking Process does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with a stick of length one.&lt;/li&gt;
&lt;li&gt;Generate a random variable $\beta_1 \sim Beta(1, \alpha)$. By the definition of the &lt;a href=&quot;http://en.wikipedia.org/wiki/Beta_distribution&quot;&gt;Beta distribution&lt;/a&gt;, this will be a real number between 0 and 1, with expected value $1 / (1 + \alpha)$. Break off the stick at $\beta_1$; $w_1$ is then the length of the stick on the left.&lt;/li&gt;
&lt;li&gt;Now take the stick to the right, and generate $\beta_2 \sim Beta(1, \alpha)$. Break off the stick $\beta_2$ into the stick. Again, $w_2$ is the length of the stick to the left, i.e., $w_2 = (1 - \beta_1) \beta_2$.&lt;/li&gt;
&lt;li&gt;And so on.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Thus, the Stick-Breaking process is simply the CRP or Polya Urn Model from a different point of view. For example, assigning customers to table 1 according to the Chinese Restaurant Process is equivalent to assigning customers to table 1 with probability $w_1$.&lt;/p&gt;

&lt;h3&gt;Code&lt;/h3&gt;

&lt;p&gt;Here&amp;#8217;s some R code for simulating a Stick-Breaking process:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Stick-Breaking Process &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/dirichlet-process/blob/master/stick_breaking_process.R&#8217;&gt;stick_breaking_process.R&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# Return a vector of weights drawn from a stick-breaking process&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# with dispersion `alpha`.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;#&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# Recall that the kth weight is&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;#   \beta_k = (1 - \beta_1) * (1 - \beta_2) * &#8230; * (1 - \beta_{k-1}) * beta_k&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# where each $&#92;beta\_i$ is drawn from a Beta distribution&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;#   \beta_i ~ Beta(1, \alpha)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;stick_breaking_process &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;kr&quot;&gt;function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;num_weights&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; alpha&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  betas &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; rbeta&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;num_weights&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; alpha&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  remaining_stick_lengths &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; cumprod&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; betas&lt;span class=&quot;p&quot;&gt;))[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;:num_weights&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  weights &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; remaining_stick_lengths &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; betas
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  weights
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;And here&amp;#8217;s some sample output:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/sbp_alpha_1.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/sbp_alpha_1.png&quot; alt=&quot;Stick-Breaking Process, alpha = 1&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/sbp_alpha_3.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/sbp_alpha_3.png&quot; alt=&quot;Stick-Breaking Process, alpha = 3&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/sbp_alpha_5.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/sbp_alpha_5.png&quot; alt=&quot;Stick-Breaking Process, alpha = 5&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice that for low values of alpha, the stick weights are concentrated on the first few weights (meaning our data points are concentrated on a few clusters), while the weights become more evenly dispersed as we increase alpha (meaning we posit more clusters in our data points).&lt;/p&gt;

&lt;h2&gt;Dirichlet Process&lt;/h2&gt;

&lt;p&gt;Suppose we run a Polya Urn Model several times, where we sample colors from a base distribution $G_0$. Each run produces a distribution of colors in the urn (say, 5% blue balls, 3% red balls, 2% pink balls, etc.), and the distribution will be different each time (for example, 5% blue balls in run 1, but 1% blue balls in run 2).&lt;/p&gt;

&lt;p&gt;For example, let&amp;#8217;s look again at the plots from above, where I generated samples from a Polya Urn Model with the standard unit normal as the base distribution:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_1.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_1.png&quot; alt=&quot;Polya Urn Model, Alpha = 1&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_5.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_5.png&quot; alt=&quot;Polya Urn Model, Alpha = 5&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_25.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_25.png&quot; alt=&quot;Polya Urn Model, Alpha = 25&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_50.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/polya_alpha_50.png&quot; alt=&quot;Polya Urn Model, Alpha = 50&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each run of the Polya Urn Model produces a slighly different distribution, though each is &amp;#8220;centered&amp;#8221; in some fashion around the standard Gaussian I used as base. In other words, the Polya Urn Model gives us a &lt;strong&gt;distribution over distributions&lt;/strong&gt; (we get a distribution of ball colors, and this distribution of colors changes each time) &amp;#8211; and so we finally get to the Dirichlet Process.&lt;/p&gt;

&lt;p&gt;Formally, given a base distribution $G_0$ and a dispersion parameter $\alpha$, a sample from the Dirichlet Process $DP(G_0, \alpha)$ is a distribution $G \sim DP(G_0, \alpha)$. This sample $G$ can be thought of as a distribution of colors in a single simulation of the Polya Urn Model; sampling from $G$ gives us the balls in the urn.&lt;/p&gt;

&lt;p&gt;So here&amp;#8217;s the connection between the Chinese Restaurant Process, the Polya Urn Model, the Stick-Breaking Process, and the Dirichlet Process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dirichlet Process&lt;/strong&gt;: Suppose we want samples $x_i \sim G$, where $G$ is a distribution sampled from the Dirichlet Process $G \sim DP(G_0, \alpha)$.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Polya Urn Model&lt;/strong&gt;: One way to generate these values $x_i$ would be to take a Polya Urn Model with color distribution $G_0$ and dispersion $\alpha$. ($x_i$ would be the color of the ith ball in the urn.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chinese Restaurant Process&lt;/strong&gt;: Another way to generate $x_i$ would be to first assign tables to customers according to a Chinese Restaurant Process with dispersion $\alpha$. Every customer at the nth table would then be given the same value (color) sampled from $G_0$. ($x_i$ would be the value given to the ith customer; $x_i$ can also be thought of as the food at table $i$, or as the parameters of table $i$.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stick-Breaking Process&lt;/strong&gt;: Finally, we could generate weights $w_k$ according to a Stick-Breaking Process with dispersion $\alpha$. Next, we would give each weight $w_k$ a value (or color) $v_k$ sampled from $G_0$. Finally, we would assign $x_i$ to value (color) $v_k$ with probability $w_k$.&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;Recap&lt;/h1&gt;

&lt;p&gt;Let&amp;#8217;s summarize what we&amp;#8217;ve discussed so far.&lt;/p&gt;

&lt;p&gt;We have a bunch of data points $p_i$ that we want to cluster, and we&amp;#8217;ve described four essentially equivalent generative models that allow us to describe how each cluster and point could have arisen.&lt;/p&gt;

&lt;p&gt;In the &lt;strong&gt;Chinese Restaurant Process&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We generate table assignments $g_1, \ldots, g_n \sim CRP(\alpha)$ according to a Chinese Restaurant Process. ($g_i$ is the table assigned to datapoint $i$.)&lt;/li&gt;
&lt;li&gt;We generate table parameters $\phi_1, \ldots, \phi_n \sim G_0$ according to the base distribution $G_0$, where $\phi_k$ is the parameter for the kth distinct group.&lt;/li&gt;
&lt;li&gt;Given table assignments and table parameters, we generate each datapoint $p_i \sim F(\phi_{g_i})$ from a distribution $F$ with the specified table parameters. (For example, $F$ could be a Gaussian, and $\phi_i$ could be a parameter vector specifying the mean and standard deviation).&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;In the &lt;strong&gt;Polya Urn Model&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We generate colors $\phi_1, \ldots, \phi_n \sim Polya(G_0, \alpha)$ according to a Polya Urn Model. ($\phi_i$ is the color of the ith ball.)&lt;/li&gt;
&lt;li&gt;Given ball colors, we generate each datapoint $p_i \sim F(\phi_i)$.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;In the &lt;strong&gt;Stick-Breaking Process&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We generate group probabilities (stick lengths) $w_1, \ldots, w_{\infty} \sim Stick(\alpha)$ according to a Stick-Breaking process.&lt;/li&gt;
&lt;li&gt;We generate group parameters $\phi_1, \ldots, \phi_{\infty} \sim G_0$ from $G_0$, where $\phi_k$ is the parameter for the kth distinct group.&lt;/li&gt;
&lt;li&gt;We generate group assignments $g_1, \ldots, g_n \sim Multinomial(w_1, \ldots, w_{\infty})$ for each datapoint.&lt;/li&gt;
&lt;li&gt;Given group assignments and group parameters, we generate each datapoint $p_i \sim F(\phi_{g_i})$.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;In the &lt;strong&gt;Dirichlet Process&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We generate a distribution $G \sim DP(G_0, \alpha)$ from a Dirichlet Process with base distribution $G_0$ and dispersion parameter $\alpha$.&lt;/li&gt;
&lt;li&gt;We generate group-level parameters $x_i \sim G$ from $G$, where $x_i$ is the group parameter for the ith datapoint. (Note: this is not the same as $\phi_i$. $x_i$ is the parameter associated to the group that the ith datapoint belongs to, whereas $\phi_k$ is the parameter of the kth distinct group.)&lt;/li&gt;
&lt;li&gt;Given group-level parameters $x_i$, we generate each datapoint $p_i \sim F(x_i)$.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Also, remember that each model naturally allows the number of clusters to grow as more points come in.&lt;/p&gt;

&lt;h1&gt;Inference in the Dirichlet Process Mixture&lt;/h1&gt;

&lt;p&gt;So we&amp;#8217;ve described a generative model that allows us to calculate the probability of any particular set of group assignments to data points, but we haven&amp;#8217;t described how to actually learn a good set of group assignments.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s briefly do this now. Very roughly, the &lt;strong&gt;Gibbs sampling&lt;/strong&gt; approach works as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Take the set of data points, and randomly initialize group assignments.&lt;/li&gt;
&lt;li&gt;Pick a point. Fix the group assignments of all the other points, and assign the chosen point a new group (which can be either an existing cluster or a new cluster) with a CRP-ish probability (as described in the models above) that depends on the group assignments and values of all the other points.&lt;/li&gt;
&lt;li&gt;We will eventually converge on a good set of group assignments, so repeat the previous step until happy.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;For more details, &lt;a href=&quot;http://www.cs.toronto.edu/~radford/ftp/mixmc.pdf&quot;&gt;this paper&lt;/a&gt; provides a good description. Philip Resnick and Eric Hardisty also have a friendlier, more general description of Gibbs sampling (plus an application to naive Bayes) &lt;a href=&quot;http://www.cs.umd.edu/~hardisty/papers/gsfu.pdf&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;Fast Food Application: Clustering the McDonald&amp;#8217;s Menu&lt;/h1&gt;

&lt;p&gt;Finally, let&amp;#8217;s show an application of the Dirichlet Process Mixture. Unfortunately, I didn&amp;#8217;t have a data set of people&amp;#8217;s food habits offhand, so instead I took &lt;a href=&quot;http://nutrition.mcdonalds.com/nutritionexchange/nutritionfacts.pdf&quot;&gt;this list&lt;/a&gt; of McDonald&amp;#8217;s foods and nutrition facts.&lt;/p&gt;

&lt;p&gt;After normalizing each item to have an equal number of calories, and representing each item as a vector of &lt;strong&gt;(total fat, cholesterol, sodium, dietary fiber, sugars, protein, vitamin A, vitamin C, calcium, iron, calories from fat, satured fat, trans fat, carbohydrates)&lt;/strong&gt;, I ran &lt;a href=&quot;http://scikit-learn.sourceforge.net/dev/index.html&quot;&gt;scikit-learn&lt;/a&gt;&amp;#8217;s &lt;a href=&quot;http://scikit-learn.sourceforge.net/dev/modules/mixture.html&quot;&gt;Dirichlet Process Gaussian Mixture Model&lt;/a&gt; to cluster McDonald&amp;#8217;s menu based on nutritional value.&lt;/p&gt;

&lt;p&gt;First, how does the number of clusters inferred by the Dirichlet Process mixture vary as we feed in more (randomly ordered) points?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/num-clusters-vary.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/num-clusters-vary.png&quot; alt=&quot;Growth of Number of Clusters&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As expected, the Dirichlet Process model discovers more and more clusters as more and more food items arrive. (And indeed, the number of clusters appears to grow logarithmically, which can in fact be proved.)&lt;/p&gt;

&lt;p&gt;How many clusters does the mixture model infer from the entire dataset? Running the Gibbs sampler several times, we find that the number of clusters tends around 11:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/num_mcdonalds_clusters.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/num_mcdonalds_clusters_small.png&quot; alt=&quot;Number of clusters&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s dive into one of these clusterings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster 1 (Desserts)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Looking at a sample of foods from the first cluster, we find a lot of desserts and dessert-y drinks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Caramel Mocha&lt;/li&gt;
&lt;li&gt;Frappe Caramel&lt;/li&gt;
&lt;li&gt;Iced Hazelnut Latte&lt;/li&gt;
&lt;li&gt;Iced Coffee&lt;/li&gt;
&lt;li&gt;Strawberry Triple Thick Shake&lt;/li&gt;
&lt;li&gt;Snack Size McFlurry&lt;/li&gt;
&lt;li&gt;Hot Caramel Sundae&lt;/li&gt;
&lt;li&gt;Baked Hot Apple Pie&lt;/li&gt;
&lt;li&gt;Cinnamon Melts&lt;/li&gt;
&lt;li&gt;Kiddie Cone&lt;/li&gt;
&lt;li&gt;Strawberry Sundae&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;We can also look at the nutritional profile of some foods from this cluster (after &lt;a href=&quot;http://en.wikipedia.org/wiki/Standard_score&quot;&gt;z-scaling&lt;/a&gt; each nutrition dimension to have mean 0 and standard deviation 1):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster1.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster1.png&quot; alt=&quot;Cluster 1&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We see that foods in this cluster tend to be high in trans fat and low in vitamins, protein, fiber, and sodium.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster 2 (Sauces)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s a sample from the second cluster, which contains a lot of sauces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hot Mustard Sauce&lt;/li&gt;
&lt;li&gt;Spicy Buffalo Sauce&lt;/li&gt;
&lt;li&gt;Newman&amp;#8217;s Own Low Fat Balsamic Vinaigrette&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;And looking at the nutritional profile of points in this cluster, we see that it&amp;#8217;s heavy in sodium and fat:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster2.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster2.png&quot; alt=&quot;Cluster 2&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster 3 (Burgers, Crispy Foods, High-Cholesterol)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The third cluster is very burgery:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hamburger&lt;/li&gt;
&lt;li&gt;Cheeseburger&lt;/li&gt;
&lt;li&gt;Filet-O-Fish&lt;/li&gt;
&lt;li&gt;Quarter Pounder with Cheese&lt;/li&gt;
&lt;li&gt;Premium Grilled Chicken Club Sandwich&lt;/li&gt;
&lt;li&gt;Ranch Snack Wrap&lt;/li&gt;
&lt;li&gt;Premium Asian Salad with Crispy Chicken&lt;/li&gt;
&lt;li&gt;Butter Garlic Croutons&lt;/li&gt;
&lt;li&gt;Sausage McMuffin&lt;/li&gt;
&lt;li&gt;Sausage McGriddles&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;It&amp;#8217;s also high in fat and sodium, and low in carbs and sugar&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster3.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster3.png&quot; alt=&quot;Cluster 3&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster 4 (Creamy Sauces)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Interestingly, even though we already found a cluster of sauces above, we discover another one as well. These sauces appear to be much more cream-based:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creamy Ranch Sauce&lt;/li&gt;
&lt;li&gt;Newman&amp;#8217;s Own Creamy Caesar Dressing&lt;/li&gt;
&lt;li&gt;Coffee Cream&lt;/li&gt;
&lt;li&gt;Iced Coffee with Sugar Free Vanilla Syrup&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Nutritionally, these sauces are higher in calories from fat, and much lower in sodium:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster4.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster4.png&quot; alt=&quot;Cluster 4&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster 5 (Salads)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s a salad cluster. A lot of salads also appeared in the third cluster (along with hamburgers and McMuffins), but that&amp;#8217;s because those salads also all contained crispy chicken. The salads in this cluster are either crisp-free or have their chicken grilled instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Premium Southwest Salad with Grilled Chicken&lt;/li&gt;
&lt;li&gt;Premium Caesar Salad with Grilled Chicken&lt;/li&gt;
&lt;li&gt;Side Salad&lt;/li&gt;
&lt;li&gt;Premium Asian Salad without Chicken&lt;/li&gt;
&lt;li&gt;Premium Bacon Ranch Salad without Chicken&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;This is reflected in the higher content of iron, vitamin A, and fiber:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster5.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster5.png&quot; alt=&quot;Cluster 5&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster 6 (More Sauces)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Again, we find another cluster of sauces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ketchup Packet&lt;/li&gt;
&lt;li&gt;Barbeque Sauce&lt;/li&gt;
&lt;li&gt;Chipotle Barbeque Sauce&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;These are still high in sodium, but much lower in fat compared to the other sauce clusters:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster6.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster6.png&quot; alt=&quot;Cluster 6&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster 7 (Fruit and Maple Oatmeal)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Amusingly, fruit and maple oatmeal is in a cluster by itself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fruit &amp;amp; Maple Oatmeal&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster7.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster7.png&quot; alt=&quot;Cluster 7&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster 8 (Sugary Drinks)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We also get a cluster of sugary drinks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strawberry Banana Smoothie&lt;/li&gt;
&lt;li&gt;Wild Berry Smoothie&lt;/li&gt;
&lt;li&gt;Iced Nonfat Vanilla Latte&lt;/li&gt;
&lt;li&gt;Nonfat Hazelnut&lt;/li&gt;
&lt;li&gt;Nonfat Vanilla Cappuccino&lt;/li&gt;
&lt;li&gt;Nonfat Caramel Cappuccino&lt;/li&gt;
&lt;li&gt;Sweet Tea&lt;/li&gt;
&lt;li&gt;Frozen Strawberry Lemonade&lt;/li&gt;
&lt;li&gt;Coca-Cola&lt;/li&gt;
&lt;li&gt;Minute Maid Orange Juice&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;In addition to high sugar content, this cluster is also high in carbohydrates and calcium, and low in fat.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster8.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster8.png&quot; alt=&quot;Cluster 8&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster 9 (Breakfast Foods)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s a cluster of high-cholesterol breakfast foods:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sausage McMuffin with Egg&lt;/li&gt;
&lt;li&gt;Sausage Burrito&lt;/li&gt;
&lt;li&gt;Egg McMuffin&lt;/li&gt;
&lt;li&gt;Bacon, Egg &amp;amp; Chees Biscuit&lt;/li&gt;
&lt;li&gt;McSkillet Burrito with Sausage&lt;/li&gt;
&lt;li&gt;Big Breakfast with Hotcakes&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster9.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster9.png&quot; alt=&quot;Cluster 9&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster 10 (Coffee Drinks)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We find a group of coffee drinks next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nonfat Cappuccino&lt;/li&gt;
&lt;li&gt;Nonfat Latte&lt;/li&gt;
&lt;li&gt;Nonfat Latte with Sugar Free Vanilla Syrup&lt;/li&gt;
&lt;li&gt;Iced Nonfat Latte&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;These are much higher in calcium and protein, and lower in sugar, than the other drink cluster above:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster11.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster11.png&quot; alt=&quot;Cluster 11&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster 11 (Apples)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s a cluster of apples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apple Dippers with Low Fat Caramel Dip&lt;/li&gt;
&lt;li&gt;Apple Slices&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Vitamin C, check.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster10.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/cluster10.png&quot; alt=&quot;Cluster 10&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And finally, here&amp;#8217;s an overview of all the clusters at once (using a different clustering run):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/all-clusters.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/dirichlet-process/all-clusters-small.png&quot; alt=&quot;All Clusters&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;No More!&lt;/h1&gt;

&lt;p&gt;I&amp;#8217;ll end with a couple notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kevin Knight has a &lt;a href=&quot;http://www.isi.edu/natural-language/people/bayes-with-tears.pdf&quot;&gt;hilarious introduction&lt;/a&gt; to Bayesian inference that describes some applications of nonparametric Bayesian techniques to computational linguistics (though I don&amp;#8217;t think he ever quite says &amp;#8220;nonparametric Bayes&amp;#8221; directly).&lt;/li&gt;
&lt;li&gt;In the Chinese Restaurant Process, each customer sits at a single table. The &lt;a href=&quot;http://en.wikipedia.org/wiki/Chinese_restaurant_process#The_Indian_buffet_process&quot;&gt;Indian Buffet Process&lt;/a&gt; is an extension that allows customers to sample food from multiple tables (i.e., belong to multiple clusters).&lt;/li&gt;
&lt;li&gt;The Chinese Restaurant Process, the Polya Urn Model, and the Stick-Breaking Process are all &lt;em&gt;sequential&lt;/em&gt; models for generating groups: to figure out table parameters in the CRP, for example, you wait for customer 1 to come in, then customer 2, then customer 3, and so on. The equivalent Dirichlet Process, on the other hand, is a &lt;em&gt;parallel&lt;/em&gt; model for generating groups: just sample $G \sim DP(G_0, alpha)$, and then all your group parameters can be independently generated by sampling from $G$ at once. This duality is an instance of a more general phenomenon known as &lt;a href=&quot;http://en.wikipedia.org/wiki/De_Finetti&#8217;s_theorem&quot;&gt;de Finetti&amp;#8217;s theorem&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;And that&amp;#8217;s it.&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Instant interactive visualization with d3 + ggplot2</title>
    <link href="http://blog.echen.me/2012/03/05/instant-interactive-visualization-with-d3-and-ggplot2/"/>
    <updated>2012-03-05T09:37:00-08:00</updated>
    <id>http://blog.echen.me/2012/03/05/instant-interactive-visualization-with-d3-and-ggplot2</id>
    <content type="html">&lt;p&gt;It&amp;#8217;s often easier to understand a chart than a table. So why is it still so hard to make a simple data graphic, and why am I still bombarded by mind-numbing reams of raw &lt;em&gt;numbers&lt;/em&gt;?&lt;/p&gt;

&lt;p&gt;(Yeah, I love &lt;a href=&quot;http://blog.echen.me/2012/01/17/quick-introduction-to-ggplot2/&quot;&gt;ggplot2&lt;/a&gt; to death. But sometimes I want a little more interaction, and sometimes all I want is to drag-and-drop and be done.)&lt;/p&gt;

&lt;p&gt;So I&amp;#8217;ve been experimenting with &lt;a href=&quot;http://minifolds.herokuapp.com/graphs/1?x=health&amp;amp;y=speed&amp;amp;size=intelligence&amp;amp;color=age&amp;amp;group=height&quot;&gt;a small, ggplot2-inspired d3 app&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Simply drop a file, and bam! Instant scatterplot:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://minifolds.herokuapp.com/graphs/1?x=health&amp;amp;y=speed&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/minifolds/swiss-roll-bw.png&quot; alt=&quot;Swiss Roll B&amp;amp;W&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But wait &amp;#8211; that&amp;#8217;s only 2 dimensions. You can add some more through color, size, and groups:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://minifolds.herokuapp.com/graphs/1?x=health&amp;amp;y=speed&amp;amp;size=intelligence&amp;amp;color=age&amp;amp;group=height&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/minifolds/swiss-roll-edit.png&quot; alt=&quot;Swiss Roll Edit&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(Click &lt;a href=&quot;http://minifolds.herokuapp.com/graphs/1?x=health&amp;amp;y=speed&amp;amp;size=intelligence&amp;amp;color=age&amp;amp;group=height&quot;&gt;here&lt;/a&gt; to play with the data yourself.)&lt;/p&gt;

&lt;p&gt;And you can easily switch which variables are getting plotted, and see all the information associated with each point.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://minifolds.herokuapp.com/graphs/1?x=weight&amp;amp;y=speed&amp;amp;size=health&amp;amp;color=age&amp;amp;group=height&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/minifolds/swiss-roll-pivot.png&quot; alt=&quot;Swiss Roll Pivot&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(Same dataset, different aesthetic assignments.)&lt;/p&gt;

&lt;p&gt;I&amp;#8217;m thinking of adding more kinds of charts, support for categorical variables, more interactivity (sliders to interact with other dimensions?!), and making the UI even easier (e.g., simplify column naming). In the meantime, the code is &lt;a href=&quot;https://github.com/echen/minifolds&quot;&gt;here&lt;/a&gt; on Github, and tips and suggestions are welcome!&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Movie recommendations and more via MapReduce and Scalding</title>
    <link href="http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/"/>
    <updated>2012-02-09T03:20:00-08:00</updated>
    <id>http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding</id>
    <content type="html">&lt;p&gt;&lt;em&gt;Scalding is an in-house MapReduce framework that Twitter recently open-sourced. Like &lt;a href=&quot;http://pig.apache.org/&quot;&gt;Pig&lt;/a&gt;, it provides an abstraction on top of MapReduce that makes it easy to write big data jobs in a syntax that&amp;#8217;s simple and concise. Unlike Pig, Scalding is written in pure Scala &amp;#8211; which means all the power of Scala and the JVM is already built-in. No more UDFs, folks!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is going to be an in-your-face introduction to &lt;a href=&quot;https://github.com/twitter/scalding&quot;&gt;Scalding&lt;/a&gt;, Twitter&amp;#8217;s (Scala + Cascading) MapReduce framework.&lt;/p&gt;

&lt;p&gt;In 140: instead of forcing you to write raw &lt;code&gt;map&lt;/code&gt; and &lt;code&gt;reduce&lt;/code&gt; functions, Scalding allows you to write &lt;em&gt;natural&lt;/em&gt; code like&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// Create a histogram of tweet lengths.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;n&quot;&gt;tweets&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;tweet&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweet&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;length&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;Not much different from the Ruby you&amp;#8217;d write to compute tweet distributions over &lt;em&gt;small&lt;/em&gt; data? &lt;strong&gt;Exactly.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two notes before we begin:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/echen/scaldingale&quot;&gt;This Github repository&lt;/a&gt; contains all the code used.&lt;/li&gt;
&lt;li&gt;For a gentler introduction to Scalding, see &lt;a href=&quot;https://github.com/twitter/scalding/wiki/Getting-Started&quot;&gt;this Getting Started guide&lt;/a&gt; on the Scalding wiki.&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;Movie Similarities&lt;/h1&gt;

&lt;p&gt;Imagine you run an online movie business, and you want to generate movie recommendations. You have a rating system (people can rate movies with 1 to 5 stars), and we&amp;#8217;ll assume for simplicity that all of the ratings are stored in a TSV file somewhere.&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s start by reading the ratings into a Scalding job.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Input &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/MovieSimilarities.scala&#8217;&gt;MovieSimilarities.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;21&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;22&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;23&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * The input is a TSV file with three columns: (user, movie, rating).&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;INPUT_FILENAME&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;data/ratings.tsv&amp;quot;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Read in the input and give each field a type and name.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratings&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Tsv&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;INPUT_FILENAME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Let&amp;#39;s also keep track of the total number of people who rated each movie.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numRaters&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;ratings&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;c1&quot;&gt;// Put the number of people who rated each movie into a field called &amp;quot;numRaters&amp;quot;.    &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rename&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;numRaters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// Merge `ratings` with `numRaters`, by joining on their movie fields.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingsWithSize&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;ratings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;joinWithSmaller&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numRaters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// ratingsWithSize now contains the following fields: (user, movie, rating, numRaters).&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;You want to calculate how similar pairs of movies are, so that if someone watches &lt;em&gt;The Lion King&lt;/em&gt;, you can recommend films like &lt;em&gt;Toy Story&lt;/em&gt;. So how should you define the similarity between two movies?&lt;/p&gt;

&lt;p&gt;One way is to use their &lt;strong&gt;correlation&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For every pair of movies A and B, find all the people who rated both A and B.&lt;/li&gt;
&lt;li&gt;Use these ratings to form a Movie A vector and a Movie B vector.&lt;/li&gt;
&lt;li&gt;Calculate the correlation between these two vectors.&lt;/li&gt;
&lt;li&gt;Whenever someone watches a movie, you can then recommend the movies most correlated with it.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Let&amp;#8217;s start with the first two steps.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Find rating pairs &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/MovieSimilarities.scala&#8217;&gt;MovieSimilarities.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * To get all pairs of co-rated movies, we&amp;#39;ll join `ratings` against itself.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * So first make a dummy copy of the ratings that we can join against.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratings2&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;ratingsWithSize&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rename&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;numRaters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;numRaters2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Now find all pairs of co-rated movies (pairs of movies that a user has rated) by&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * joining the duplicate rating streams on their user fields, &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingPairs&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;ratingsWithSize&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;joinWithSmaller&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratings2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;c1&quot;&gt;// De-dupe so that we don&amp;#39;t calculate similarity of both (A, B) and (B, A).&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;movies&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;movies&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;movies&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;project&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;numRaters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;numRaters2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// By grouping on (&amp;#39;movie, &amp;#39;movie2), we can now get all the people who rated any pair of movies.&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;Before using these rating pairs to calculate correlation, let&amp;#8217;s stop for a bit.&lt;/p&gt;

&lt;p&gt;Since we&amp;#8217;re explicitly thinking of movies as &lt;strong&gt;vectors&lt;/strong&gt; of ratings, it&amp;#8217;s natural to compute some very vector-y things like norms and dot products, as well as the length of each vector and the sum over all elements in each vector. So let&amp;#8217;s compute these:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Vector calculations &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/MovieSimilarities.scala&#8217;&gt;MovieSimilarities.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;21&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; * Compute dot products, norms, sums, and sizes of the rating vectors.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt; */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vectorCalcs&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;ratingPairs&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;c1&quot;&gt;// Compute (x*y, x^2, y^2), which we need for dot products and norms.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;ratingProd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;ratingSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating2Sq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;n&quot;&gt;ratings&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ratings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ratings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ratings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;group&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;n&quot;&gt;group&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// length of each vector&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;ratingProd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;dotProduct&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;ratingSum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating2Sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;ratingSq&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;ratingNormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating2Sq&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating2NormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;numRaters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// Just an easy way to make sure the numRaters field stays.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;numRaters2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;                
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;c1&quot;&gt;// All of these operations chain together like in a builder object.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;To summarize, each row in &lt;code&gt;vectorCalcs&lt;/code&gt; now contains the following fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;movie, movie2&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;numRaters, numRaters2&lt;/strong&gt;: the total number of people who rated each movie&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;size&lt;/strong&gt;: the number of people who rated both movie and movie2&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;dotProduct&lt;/strong&gt;: dot product between the movie vector (a vector of ratings) and the movie2 vector (also a vector of ratings)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ratingSum, rating2sum&lt;/strong&gt;: sum over all elements in each ratings vector&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ratingNormSq, rating2Normsq&lt;/strong&gt;: squared norm of each vector&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So let&amp;#8217;s go back to calculating the correlation between movie and movie2. We could, of course, calculate correlation in the standard way: find the covariance between the movie and movie2 ratings, and divide by their standard deviations.&lt;/p&gt;

&lt;p&gt;But recall that we can also write correlation in the following form:&lt;/p&gt;

&lt;p&gt;$Corr(X, Y) = \frac{n \sum xy - \sum x \sum y}{\sqrt{n \sum x^2 - (\sum x)^2} \sqrt{n \sum y^2 - (\sum y)^2}}$&lt;/p&gt;

&lt;p&gt;(See the &lt;a href=&quot;http://en.wikipedia.org/wiki/Correlation_and_dependence&quot;&gt;Wikipedia page&lt;/a&gt; on correlation.)&lt;/p&gt;

&lt;p&gt;Notice that every one of the elements in this formula is a field in &lt;code&gt;vectorCalcs&lt;/code&gt;! So instead of using the standard calculation, we can use this form instead:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Correlation &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/MovieSimilarities.scala&#8217;&gt;MovieSimilarities.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;correlations&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;vectorCalcs&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;dotProduct&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;ratingSum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating2Sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;ratingNormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating2NormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;correlation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;n&quot;&gt;correlation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_4&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_5&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_6&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;correlation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dotProduct&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingSum&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;rating2Sum&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingNormSq&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2NormSq&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numerator&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dotProduct&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingSum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2Sum&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;denominator&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sqrt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingNormSq&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingSum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingSum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sqrt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2NormSq&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2Sum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2Sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;numerator&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;denominator&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;And that&amp;#8217;s it! To see the full code, check out the Github repository &lt;a href=&quot;https://github.com/echen/scaldingale&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;Book Similarities&lt;/h1&gt;

&lt;p&gt;Let&amp;#8217;s run this code over some real data. Unfortunately, I didn&amp;#8217;t have a clean source of movie ratings available, so instead I used &lt;a href=&quot;http://www.informatik.uni-freiburg.de/~cziegler/BX/&quot;&gt;this dataset&lt;/a&gt; of 1 million book ratings.&lt;/p&gt;

&lt;p&gt;I ran a quick command, using the handy &lt;a href=&quot;https://github.com/twitter/scalding/wiki/Scald.rb&quot;&gt;scald.rb script&lt;/a&gt; that Scalding provides&amp;#8230;&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;bash&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c&quot;&gt;# Send the job off to a Hadoop cluster&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;scald.rb MovieSimilarities.scala &#8211;input ratings.tsv &#8211;output similarities.tsv
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&amp;#8230;and here&amp;#8217;s a sample of the top output I got:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/top-book-crossing-sims-correlation.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/top-book-crossing-sims-correlation.png&quot; alt=&quot;Top Book-Crossing Pairs&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we&amp;#8217;d expect, we see that&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Harry Potter&lt;/em&gt; books are similar to other &lt;em&gt;Harry Potter&lt;/em&gt; books&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Lord of the Rings&lt;/em&gt; books are similar to other &lt;em&gt;Lord of the Rings&lt;/em&gt; books&lt;/li&gt;
&lt;li&gt;Tom Clancy is similar to John Grisham&lt;/li&gt;
&lt;li&gt;Chick lit (&lt;em&gt;Summer Sisters&lt;/em&gt;, by Judy Blume) is similar to chick lit (&lt;em&gt;Bridget Jones&lt;/em&gt;)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Just for fun, let&amp;#8217;s also look at books similar to &lt;em&gt;The Great Gatsby&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/great-gatsby-correlation.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/great-gatsby-correlation.png&quot; alt=&quot;Great Gatsby&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(Schoolboy memories, exactly.)&lt;/p&gt;

&lt;h1&gt;More Similarity Measures&lt;/h1&gt;

&lt;p&gt;Of course, there are lots of other similarity measures we could use besides correlation.&lt;/p&gt;

&lt;h2&gt;Cosine Similarity&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/Cosine_similarity&quot;&gt;Cosine similarity&lt;/a&gt; is a another common vector-based similarity measure.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Cosine Similarity &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/MovieSimilarities.scala&#8217;&gt;MovieSimilarities.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cosineSimilarity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dotProduct&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingNorm&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2Norm&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;dotProduct&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ratingNorm&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2Norm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;h2&gt;Correlation, Take II&lt;/h2&gt;

&lt;p&gt;We can also also add a &lt;em&gt;regularized&lt;/em&gt; correlation, by (say) adding N virtual movie pairs that have zero correlation. This helps avoid noise if some movie pairs have very few raters in common (for example, &lt;em&gt;The Great Gatsby&lt;/em&gt; had an unlikely raw correlation of 1 with many other books, due simply to the fact that those book pairs had very few ratings).&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Regularized Correlation &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/MovieSimilarities.scala&#8217;&gt;MovieSimilarities.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;regularizedCorrelation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dotProduct&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingSum&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;rating2Sum&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingNormSq&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2NormSq&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;virtualCount&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;priorCorrelation&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unregularizedCorrelation&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;correlation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dotProduct&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingSum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2Sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingNormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2NormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;virtualCount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;w&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unregularizedCorrelation&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;priorCorrelation&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;h2&gt;Jaccard Similarity&lt;/h2&gt;

&lt;p&gt;Recall that &lt;a href=&quot;http://blog.echen.me/blog/2011/10/24/winning-the-netflix-prize-a-summary/&quot;&gt;one of the lessons of the Netflix prize&lt;/a&gt; was that implicit data can be quite useful &amp;#8211; the mere fact that you rate a James Bond movie, even if you rate it quite horribly, suggests that you&amp;#8217;d probably be interested in similar action films. So we can also ignore the value itself of each rating and use a &lt;em&gt;set&lt;/em&gt;-based similarity measure like &lt;a href=&quot;http://en.wikipedia.org/wiki/Jaccard_index&quot;&gt;Jaccard similarity&lt;/a&gt;.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Jaccard Similarity &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/MovieSimilarities.scala&#8217;&gt;MovieSimilarities.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jaccardSimilarity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;usersInCommon&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;totalUsers1&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;totalUsers2&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;union&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;totalUsers1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;totalUsers2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;usersInCommon&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;usersInCommon&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;union&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;h2&gt;Incorporation&lt;/h2&gt;

&lt;p&gt;Finally, let&amp;#8217;s add all these similarity measures to our output.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Similarity Measures &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/MovieSimilarities.scala&#8217;&gt;MovieSimilarities.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;PRIOR_COUNT&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;PRIOR_CORRELATION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;similarities&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;n&quot;&gt;vectorCalcs&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;dotProduct&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;ratingSum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating2Sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;ratingNormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating2NormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;numRaters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;numRaters2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;correlation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;regularizedCorrelation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;cosineSimilarity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;jaccardSimilarity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dotProduct&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingSum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2Sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingNormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2NormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numRaters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numRaters2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;corr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;correlation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dotProduct&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingSum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2Sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingNormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2NormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;regCorr&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;regularizedCorrelation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dotProduct&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingSum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2Sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingNormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rating2NormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;PRIOR_COUNT&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;PRIOR_CORRELATION&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cosSim&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cosineSimilarity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dotProduct&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sqrt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ratingNormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sqrt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rating2NormSq&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jaccard&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jaccardSimilarity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numRaters&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numRaters2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;corr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;regCorr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cosSim&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jaccard&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;h1&gt;Book Similarities Revisited&lt;/h1&gt;

&lt;p&gt;Let&amp;#8217;s take another look at the book similarities above, now that we have these new fields.&lt;/p&gt;

&lt;p&gt;Here are some of the top Book-Crossing pairs, sorted by their shrunk correlation:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/top-book-crossing-sims.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/top-book-crossing-sims.png&quot; alt=&quot;Top Book-Crossing Pairs&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice how regularization affects things: the &lt;em&gt;Dark Tower&lt;/em&gt; pair has a pretty high raw correlation, but relatively few ratings (reducing our confidence in the raw correlation), so it ends up below the others.&lt;/p&gt;

&lt;p&gt;And here are books similar to &lt;em&gt;The Great Gatsby&lt;/em&gt;, this time ordered by cosine similarity:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/great-gatsby.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/great-gatsby.png&quot; alt=&quot;Great Gatsby&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;Input Abstraction&lt;/h1&gt;

&lt;p&gt;So our code right now is tied to our specific &lt;code&gt;ratings.tsv&lt;/code&gt; input. But what if we change the way we store our ratings, or what if we want to generate similarities for something entirely different?&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s abstract away our input. We&amp;#8217;ll create a &lt;a href=&quot;https://github.com/echen/scaldingale/blob/master/VectorSimilarities.scala&quot;&gt;VectorSimilarities class&lt;/a&gt; that represents input data in the following format:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Input abstraction &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/VectorSimilarities.scala&#8217;&gt;VectorSimilarities.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// This is an abstract method that returns a Pipe (aka, a stream of rating tuples).&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// It takes in three symbols that name the user, item, and rating fields.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itemField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Pipe&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratings&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;item&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// &#8230;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;// The rest of the code remains essentially the same.&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;Whenever we want to define a new input format, we simply subclass &lt;code&gt;VectorSimilarities&lt;/code&gt; and provide a concrete implementation of the &lt;code&gt;input&lt;/code&gt; method.&lt;/p&gt;

&lt;h2&gt;Book-Crossings&lt;/h2&gt;

&lt;p&gt;For example, here&amp;#8217;s a class I could have used to generate the book recommendations above:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;BookCrossing similarities &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/BookCrossing.scala&#8217;&gt;BookCrossing.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;BookCrossing&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;args&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;VectorSimilarities&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itemField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Pipe&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bookCrossingRatings&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;nc&quot;&gt;Tsv&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;book-crossing-ratings.tsv&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mapTo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itemField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;bookCrossingRatings&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;The input method simply reads from a TSV file and lets the &lt;code&gt;VectorSimilarities&lt;/code&gt; superclass do all the work. Instant recommendations, BOOM.&lt;/p&gt;

&lt;h2&gt;Song Similarities with Twitter + iTunes&lt;/h2&gt;

&lt;p&gt;But why limit ourselves to books? We do, after all, have Twitter at our fingertips&amp;#8230;&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p&gt;rated Born This Way by Lady GaGa 5 stars &lt;a href=&quot;http://t.co/wTYAwWqm&quot; title=&quot;http://itun.es/iSg92N&quot;&gt;itun.es/iSg92N&lt;/a&gt; &lt;a href=&quot;https://twitter.com/search/%2523iTunes&quot;&gt;#iTunes&lt;/a&gt;&lt;/p&gt;&amp;mdash; gggf (@GalMusic92) &lt;a href=&quot;https://twitter.com/GalMusic92/status/167267017865428996&quot; data-datetime=&quot;2012-02-08T15:22:19+00:00&quot;&gt;February 8, 2012&lt;/a&gt;&lt;/blockquote&gt;


&lt;script src=&quot;http://blog.echen.me//platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;


&lt;p&gt;Since iTunes lets you send a tweet whenever you rate a song, we can use these to generate music recommendations!&lt;/p&gt;

&lt;p&gt;Again, we create a new class that overrides the abstract &lt;code&gt;input&lt;/code&gt; defined in &lt;code&gt;VectorSimilarities&lt;/code&gt;&amp;#8230;&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Song similarities with Twitter + iTunes &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/ITunes.scala&#8217;&gt;ITunes.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;21&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;22&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ITunes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;args&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;VectorSimilarities&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;c1&quot;&gt;// Example tweet:&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;c1&quot;&gt;// rated New Kids On the Block: Super Hits by New Kids On the Block 5 stars http://itun.es/iSg3Fc #iTunes&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ITUNES_REGEX&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;&amp;quot;&amp;quot;rated (.+?) by (.+?) (\d) stars .*? #iTunes&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itemField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Pipe&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itunesRatings&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;// This is a Twitter-internal tweet source, but you could just as easily scrape &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;c1&quot;&gt;// Twitter yourself and provide your own source of tweets: https://dev.twitter.com/docs&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;nc&quot;&gt;TweetSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mapTo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;userId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getUserId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getText&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;text&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contains&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;#iTunes&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;text&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;song&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;artist&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;          &lt;span class=&quot;n&quot;&gt;text&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;          &lt;span class=&quot;nc&quot;&gt;ITUNES_REGEX&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;findFirstMatchIn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subgroups&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;l&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rename&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;userId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;song&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itemField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;project&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itemField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;itunesRatings&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&amp;#8230;and snap! Here are some songs you might like if you recently listened to &lt;strong&gt;Beyoncé&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/beyonce.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/beyonce.png&quot; alt=&quot;Jason Mraz&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And some recommended songs if you like &lt;strong&gt;Lady Gaga&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/lady-gaga.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/lady-gaga.png&quot; alt=&quot;Lady Gaga&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GG Pandora.&lt;/p&gt;

&lt;h2&gt;Location Similarities with Foursquare Check-ins&lt;/h2&gt;

&lt;p&gt;But what if we don&amp;#8217;t have explicit ratings? For example, we could be a news site that wants to generate article recommendations, and maybe we only have user &lt;em&gt;visits&lt;/em&gt; on each story.&lt;/p&gt;

&lt;p&gt;Or what if we want to generate restaurant or tourist recommendations, when all we know is who visits each location?&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p&gt;I&amp;#8217;m at Empire State Building (350 5th Ave., btwn 33rd &amp; 34th St., New York) &lt;a href=&quot;http://t.co/q6tXzf3n&quot; title=&quot;http://4sq.com/zZ5xGd&quot;&gt;4sq.com/zZ5xGd&lt;/a&gt;&lt;/p&gt;&amp;mdash; Simon Ackerman (@SimonAckerman) &lt;a href=&quot;https://twitter.com/SimonAckerman/status/167232054247956481&quot; data-datetime=&quot;2012-02-08T13:03:23+00:00&quot;&gt;February 8, 2012&lt;/a&gt;&lt;/blockquote&gt;


&lt;script src=&quot;http://blog.echen.me//platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;


&lt;p&gt;Let&amp;#8217;s finally make Foursquare check-ins useful. (I kid, I kid.)&lt;/p&gt;

&lt;p&gt;Instead of using an explicit rating given to us, we can simply generate a dummy rating of 1 for each check-in. Correlation doesn&amp;#8217;t make sense any more, but we can still pay attention to a measure like Jaccard simiilarity.&lt;/p&gt;

&lt;p&gt;So we simply create a new class that scrapes tweets for Foursquare check-in information&amp;#8230;&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Location similarities with Foursquare &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/Foursquare.scala&#8217;&gt;Foursquare.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Foursquare&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;args&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;VectorSimilarities&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;c1&quot;&gt;// Example tweet: I&amp;#39;m at The Ambassador (673 Geary St, btw Leavenworth &amp;amp; Jones, San Francisco) w/ 2 others http://4sq.com/xok3rI&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;c1&quot;&gt;// Let&amp;#39;s limit to New York for simplicity.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;FOURSQUARE_REGEX&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;&amp;quot;&amp;quot;I&amp;#39;m at (.+?) \(.*? New York&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itemField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Pipe&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;foursquareCheckins&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;nc&quot;&gt;TweetSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mapTo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;userId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getUserId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toLong&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getText&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;text&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;location&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;          &lt;span class=&quot;n&quot;&gt;text&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;          &lt;span class=&quot;nc&quot;&gt;FOURSQUARE_REGEX&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;findFirstMatchIn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subgroups&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;l&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;l&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rename&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;userId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;location&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itemField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unique&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itemField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;foursquareCheckins&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&amp;#8230;and bam! Here are locations similar to the &lt;strong&gt;Empire State Building&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/empire-state-building.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/empire-state-building.png&quot; alt=&quot;Empire State Building&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are places you might want to check out, if you check-in at &lt;strong&gt;Bergdorf Goodman&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/bergdorf-goodman.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/bergdorf-goodman.png&quot; alt=&quot;Bergdorf Goodman&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And here&amp;#8217;s where to go after the &lt;strong&gt;Statue of Liberty&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/statue-of-liberty.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/statue-of-liberty.png&quot; alt=&quot;Statue of Liberty&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Power of Twitter, yo.&lt;/p&gt;

&lt;h1&gt;RottenTomatoes Similarities&lt;/h1&gt;

&lt;p&gt;UPDATE: I found some movie data after all&amp;#8230;&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p&gt;My review for &amp;#8216;How to Train Your Dragon&amp;#8217; on Rotten Tomatoes: 4 1/2 stars &amp;gt;&lt;a href=&quot;http://t.co/YTOKWLEt&quot; title=&quot;http://bit.ly/xtw3d3&quot;&gt;bit.ly/xtw3d3&lt;/a&gt;&lt;/p&gt;&amp;mdash; Benjamin West (@BenTheWest) &lt;a href=&quot;https://twitter.com/BenTheWest/status/171772890121895936&quot; data-datetime=&quot;2012-02-21T01:47:03+00:00&quot;&gt;February 21, 2012&lt;/a&gt;&lt;/blockquote&gt;


&lt;p&gt;So let&amp;#8217;s use RottenTomatoes tweets to recommend movies! Here&amp;#8217;s the code for a class that searches for RottenTomatoes tweets:&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;Movie similarities with RottenTomatoes &lt;/span&gt;&lt;a href=&#8217;https://github.com/echen/scaldingale/blob/master/RottenTomatoes.scala&#8217;&gt;RottenTomatoes.scala&lt;/a&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;11&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;12&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;13&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;14&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;15&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;16&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;17&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;18&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;19&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;20&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;21&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;22&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;23&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;24&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;25&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;26&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;scala&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;RottenTomatoes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;args&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;extends&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;VectorSimilarities&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;cm&quot;&gt;/**&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;   * Example tweets:&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;   * My review for &amp;#39;Hop&amp;#39; on Rotten Tomatoes: 1 star &amp;gt; http://bit.ly/AB7Tl4&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;   * My review for &amp;#39;The Bothersome Man (Den Brysomme mannen)&amp;#39; on Rotten Tomatoes: 3 stars-A muddled Playtime in Paris,&#8230; http://tmto.es/AvPoO2&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;cm&quot;&gt;   */&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ROTTENTOMATOES_REGEX&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;&amp;quot;&amp;quot;My review for &amp;#39;(.+?)&amp;#39; on Rotten Tomatoes: (\d) star&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MIN_NUM_RATERS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MAX_NUM_RATERS&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MIN_INTERSECTION&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;k&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itemField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingField&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Pipe&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rottenTomatoesRatings&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;      &lt;span class=&quot;nc&quot;&gt;TweetSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mapTo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;userId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getUserId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toLong&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getText&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;text&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;          &lt;span class=&quot;n&quot;&gt;text&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&amp;gt;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;          &lt;span class=&quot;nc&quot;&gt;ROTTENTOMATOES_REGEX&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;findFirstMatchIn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;subgroups&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toInt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rename&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;userId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;movie&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;rating&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itemField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;        &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;unique&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;userField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;itemField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ratingField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;    &lt;span class=&quot;n&quot;&gt;rottenTomatoesRatings&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;And here are the most similar movies discovered:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/top-rottentomatoes-sims.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/top-rottentomatoes-sims.png&quot; alt=&quot;Top RottenTomatoes Movies&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We see that&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Lord of the Rings&lt;/em&gt;, &lt;em&gt;Harry Potter&lt;/em&gt;, and &lt;em&gt;Star Wars&lt;/em&gt; movies are similar to other &lt;em&gt;Lord of the Rings&lt;/em&gt;, &lt;em&gt;Harry Potter&lt;/em&gt;, and &lt;em&gt;Star Wars&lt;/em&gt; movies&lt;/li&gt;
&lt;li&gt;Big science fiction blockbusters (&lt;em&gt;Avatar&lt;/em&gt;) are similar to big science fiction blockbusters (&lt;em&gt;Inception&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;People who like one Justin Timberlake movie (&lt;em&gt;Bad Teacher&lt;/em&gt;) also like other Justin Timberlake Movies (&lt;em&gt;In Time&lt;/em&gt;). Similarly with Michael Fassbender (&lt;em&gt;A Dangerous Method&lt;/em&gt;, &lt;em&gt;Shame&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;Art house movies (&lt;em&gt;The Tree of Life&lt;/em&gt;) stick together (&lt;em&gt;Tinker Tailor Soldier Spy&lt;/em&gt;)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Let&amp;#8217;s also look at the movies with the most &lt;em&gt;negative&lt;/em&gt; correlation:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/bottom-rottentomatoes-sims.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/scaldingale/bottom-rottentomatoes-sims.png&quot; alt=&quot;Negative RottenTomatoes Movies&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(The more you like loud and dirty popcorn movies (&lt;em&gt;Thor&lt;/em&gt;) and vamp romance (&lt;em&gt;Twilight&lt;/em&gt;), the less you like arthouse? SGTM.)&lt;/p&gt;

&lt;h1&gt;Next Steps&lt;/h1&gt;

&lt;p&gt;Hopefully I gave you a taste of the awesomeness of Scalding. To learn even more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check out &lt;a href=&quot;https://github.com/twitter/scalding&quot;&gt;Scalding on Github&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Read &lt;a href=&quot;https://github.com/twitter/scalding/wiki/Getting-Started&quot;&gt;this Getting Started Guide&lt;/a&gt; on the Scalding wiki.&lt;/li&gt;
&lt;li&gt;Run through &lt;a href=&quot;https://github.com/twitter/scalding/tree/master/tutorial&quot;&gt;this code-based introduction&lt;/a&gt;, complete with Scalding jobs that you can run in local mode.&lt;/li&gt;
&lt;li&gt;Browse &lt;a href=&quot;https://github.com/twitter/scalding/wiki/API-Reference&quot;&gt;the API reference&lt;/a&gt;, which also contains many code snippets illustrating different Scalding functions (e.g., &lt;code&gt;map&lt;/code&gt;, &lt;code&gt;filter&lt;/code&gt;, &lt;code&gt;flatMap&lt;/code&gt;, &lt;code&gt;groupBy&lt;/code&gt;, &lt;code&gt;count&lt;/code&gt;, &lt;code&gt;join&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;And all the code for this post is &lt;a href=&quot;https://github.com/echen/scaldingale&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Watch out for more documentation soon, and you should most definitely &lt;a href=&quot;https://twitter.com/#!/scalding&quot;&gt;follow @Scalding&lt;/a&gt; on Twitter for updates or to ask any questions.&lt;/p&gt;

&lt;h1&gt;Mad Props&lt;/h1&gt;

&lt;p&gt;And finally, a huge shoutout to &lt;a href=&quot;https://twitter.com/argyris&quot;&gt;Argyris Zymnis&lt;/a&gt;, &lt;a href=&quot;https://twitter.com/avibryant&quot;&gt;Avi Bryant&lt;/a&gt;, and &lt;a href=&quot;https://twitter.com/posco&quot;&gt;Oscar Boykin&lt;/a&gt;, the mastermind hackers who have spent (and continue spending!) unimaginable hours making Scalding a joy to use.&lt;/p&gt;

&lt;p&gt;@argyris, @avibryant, @posco: Thanks for it all. #awesomejobguys #loveit&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Quick Introduction to ggplot2</title>
    <link href="http://blog.echen.me/2012/01/17/quick-introduction-to-ggplot2/"/>
    <updated>2012-01-17T10:28:55-08:00</updated>
    <id>http://blog.echen.me/2012/01/17/quick-introduction-to-ggplot2</id>
    <content type="html">&lt;p&gt;This is a bare-bones introduction to &lt;a href=&quot;http://had.co.nz/ggplot2/&quot;&gt;ggplot2&lt;/a&gt;, a visualization package in R. It assumes no knowledge of R.&lt;/p&gt;

&lt;p&gt;For a better-looking version of this post, see &lt;a href=&quot;https://github.com/echen/ggplot2-tutorial&quot;&gt;this Github repository&lt;/a&gt;, which also contains some of the &lt;a href=&quot;https://github.com/echen/ggplot2-tutorial/tree/master/data&quot;&gt;example datasets&lt;/a&gt; I use and a &lt;a href=&quot;https://github.com/echen/ggplot2-tutorial/blob/master/ggplot2-tutorial.R&quot;&gt;literate programming version&lt;/a&gt; of this tutorial.&lt;/p&gt;

&lt;h1&gt;Preview&lt;/h1&gt;

&lt;p&gt;Let&amp;#8217;s start with a preview of what ggplot2 can do.&lt;/p&gt;

&lt;p&gt;Given Fisher&amp;#8217;s &lt;a href=&quot;http://en.wikipedia.org/wiki/Iris_flower_data_set&quot;&gt;iris&lt;/a&gt; data set and one simple command&amp;#8230;&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;Sepal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; Petal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; iris&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; color &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; Species&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&amp;#8230;we can produce this plot of sepal length vs. petal length, colored by species.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal-specied.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal-specied.png&quot; alt=&quot;Sepal vs. Petal, Colored by Species&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;Installation&lt;/h1&gt;

&lt;p&gt;You can download R &lt;a href=&quot;http://cran.opensourceresources.org/&quot;&gt;here&lt;/a&gt;. After installation, you can launch R in interactive mode by either typing &lt;code&gt;R&lt;/code&gt; on the command line or opening the standard GUI (which should have been included in the download).&lt;/p&gt;

&lt;h1&gt;R Basics&lt;/h1&gt;

&lt;h2&gt;Vectors&lt;/h2&gt;

&lt;p&gt;Vectors are a core data structure in R, and are created with &lt;code&gt;c()&lt;/code&gt;. Elements in a vector must be of the same type.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;numbers &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;23&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;31&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;names &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;edwin&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;alice&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;bob&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;Elements are indexed starting at 1, and are accessed with &lt;code&gt;[]&lt;/code&gt; notation.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;numbers&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# 23&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;names&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# edwin&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;h2&gt;Data frames&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;http://www.r-tutor.com/r-introduction/data-frame&quot;&gt;Data frames&lt;/a&gt; are like matrices, but with named columns of different types (similar to &lt;a href=&quot;http://code.google.com/p/sqldf/&quot;&gt;database tables&lt;/a&gt;).&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;books &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; data.frame&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  title &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;harry potter&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;war and peace&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;lord of the rings&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# column named &amp;quot;title&amp;quot;&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  author &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;rowling&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;tolstoy&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;tolkien&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  num_pages &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;350&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;875&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;500&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;You can access columns of a data frame with &lt;code&gt;$&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;books&lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;title &lt;span class=&quot;c1&quot;&gt;# c(&amp;quot;harry potter&amp;quot;, &amp;quot;war and peace&amp;quot;, &amp;quot;lord of the rings&amp;quot;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;books&lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;author&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# &amp;quot;rowling&amp;quot;&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;You can also create new columns with &lt;code&gt;$&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;books&lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;num_bought_today &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;books&lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;num_bought_yesterday &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;18&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;books&lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;total\_num\_bought &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; books&lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;num_bought_today &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; books&lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;num_bought_yesterday
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;h2&gt;read.table&lt;/h2&gt;

&lt;p&gt;Suppose you want to import a TSV file into R as a data frame.&lt;/p&gt;

&lt;h3&gt;tsv file without header&lt;/h3&gt;

&lt;p&gt;For example, consider the &lt;a href=&quot;https://github.com/echen/r-tutorial/blob/master/data/students.tsv&quot;&gt;&lt;code&gt;data/students.tsv&lt;/code&gt;&lt;/a&gt; file (with columns describing each student&amp;#8217;s age, test score, and name).&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;m&quot;&gt;13&lt;/span&gt;   &lt;span class=&quot;m&quot;&gt;100&lt;/span&gt; alice
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;m&quot;&gt;14&lt;/span&gt;   &lt;span class=&quot;m&quot;&gt;95&lt;/span&gt;  bob
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;m&quot;&gt;13&lt;/span&gt;   &lt;span class=&quot;m&quot;&gt;82&lt;/span&gt;  eve
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;We can import this file into R using &lt;a href=&quot;http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html&quot;&gt;&lt;code&gt;read.table()&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;students &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; read.table&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;data/students.tsv&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  header &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k-Variable&quot;&gt;F&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# file does not contain a header (`F` is short for `FALSE`), so we must manually specify column names                    &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  sep &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;\t&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# file is tab-delimited        &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  col.names &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;age&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;score&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;name&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# column names&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;We can now access the different columns in the data frame with &lt;code&gt;students$age&lt;/code&gt;, &lt;code&gt;students$score&lt;/code&gt;, and &lt;code&gt;students$name&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;csv file with header&lt;/h3&gt;

&lt;p&gt;For an example of a file in a different format, look at the &lt;a href=&quot;https://github.com/echen/r-tutorial/blob/master/data/studentsWithHeader.tsv&quot;&gt;&lt;code&gt;data/studentsWithHeader.tsv&lt;/code&gt;&lt;/a&gt; file.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;age&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;score&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;name
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;m&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;alice
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;m&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;95&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;bob
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;m&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;82&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;eve
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;Here we have the same data, but now the file is comma-delimited and contains a header. We can import this file with&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;students &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; read.table&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;data/students.tsv&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  sep &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;,&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  header &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k-Variable&quot;&gt;T&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;# first line contains column names, so we can immediately call `students$age`        &lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;(Note: there is also a &lt;code&gt;read.csv&lt;/code&gt; function that uses &lt;code&gt;sep = &quot;,&quot;&lt;/code&gt; by default.)&lt;/p&gt;

&lt;h2&gt;help&lt;/h2&gt;

&lt;p&gt;There are many more options that &lt;code&gt;read.table&lt;/code&gt; can take. For a list of these, just type &lt;code&gt;help(read.table)&lt;/code&gt; (or &lt;code&gt;?read.table&lt;/code&gt;) at the prompt to access documentation.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# These work for other functions as well.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;help&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;read.table&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;?read.table
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;h1&gt;ggplot2&lt;/h1&gt;

&lt;p&gt;With these R basics in place, let&amp;#8217;s dive into the ggplot2 package.&lt;/p&gt;

&lt;h2&gt;Installation&lt;/h2&gt;

&lt;p&gt;One of R&amp;#8217;s greatest strengths is its excellent set of &lt;a href=&quot;http://cran.r-project.org/web/packages/available_packages_by_name.html&quot;&gt;packages&lt;/a&gt;. To install a package, you can use the &lt;code&gt;install.packages()&lt;/code&gt; function.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;install.packages&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;ggplot2&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;To load a package into your current R session, use &lt;code&gt;library()&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;library&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;ggplot2&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;h2&gt;Scatterplots with qplot()&lt;/h2&gt;

&lt;p&gt;Let&amp;#8217;s look at how to create a scatterplot in ggplot2. We&amp;#8217;ll use the &lt;code&gt;iris&lt;/code&gt; data frame that&amp;#8217;s automatically loaded into R.&lt;/p&gt;

&lt;p&gt;What does the data frame contain? We can use the &lt;code&gt;head&lt;/code&gt; function to look at the first few rows.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;10&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;head&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;iris&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# by default, head displays the first 6 rows. see `?head`&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;head&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;iris&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; n &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# we can also explicitly set the number of rows to display&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;Sepal.Length Sepal.Width Petal.Length Petal.Width Species
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;         &lt;span class=&quot;m&quot;&gt;5.1&lt;/span&gt;         &lt;span class=&quot;m&quot;&gt;3.5&lt;/span&gt;          &lt;span class=&quot;m&quot;&gt;1.4&lt;/span&gt;         &lt;span class=&quot;m&quot;&gt;0.2&lt;/span&gt;  setosa
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;         &lt;span class=&quot;m&quot;&gt;4.9&lt;/span&gt;         &lt;span class=&quot;m&quot;&gt;3.0&lt;/span&gt;          &lt;span class=&quot;m&quot;&gt;1.4&lt;/span&gt;         &lt;span class=&quot;m&quot;&gt;0.2&lt;/span&gt;  setosa
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;         &lt;span class=&quot;m&quot;&gt;4.7&lt;/span&gt;         &lt;span class=&quot;m&quot;&gt;3.2&lt;/span&gt;          &lt;span class=&quot;m&quot;&gt;1.3&lt;/span&gt;         &lt;span class=&quot;m&quot;&gt;0.2&lt;/span&gt;  setosa
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;         &lt;span class=&quot;m&quot;&gt;4.6&lt;/span&gt;         &lt;span class=&quot;m&quot;&gt;3.1&lt;/span&gt;          &lt;span class=&quot;m&quot;&gt;1.5&lt;/span&gt;         &lt;span class=&quot;m&quot;&gt;0.2&lt;/span&gt;  setosa
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;         &lt;span class=&quot;m&quot;&gt;5.0&lt;/span&gt;         &lt;span class=&quot;m&quot;&gt;3.6&lt;/span&gt;          &lt;span class=&quot;m&quot;&gt;1.4&lt;/span&gt;         &lt;span class=&quot;m&quot;&gt;0.2&lt;/span&gt;  setosa
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;         &lt;span class=&quot;m&quot;&gt;5.4&lt;/span&gt;         &lt;span class=&quot;m&quot;&gt;3.9&lt;/span&gt;          &lt;span class=&quot;m&quot;&gt;1.7&lt;/span&gt;         &lt;span class=&quot;m&quot;&gt;0.4&lt;/span&gt;  setosa
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;(The data frame actually contains three types of species: setosa, versicolor, and virginica.)&lt;/p&gt;

&lt;p&gt;Let&amp;#8217;s plot &lt;code&gt;Sepal.Length&lt;/code&gt; against &lt;code&gt;Petal.Length&lt;/code&gt; using ggplot2&amp;#8217;s &lt;code&gt;qplot()&lt;/code&gt; function.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;Sepal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; Petal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; iris&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# Plot Sepal.Length vs. Petal.Length, using data from the `iris` data frame.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# * First argument `Sepal.Length` goes on the x-axis.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# * Second argument `Petal.Length` goes on the y-axis.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# * `data = iris` means to look for this data in the `iris` data frame.    &lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal.png&quot; alt=&quot;Sepal Length vs. Petal Length&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To see where each species is located in this graph, we can color each point by adding a &lt;code&gt;color = Species&lt;/code&gt; argument.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;Sepal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; Petal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; iris&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; color &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; Species&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# dude!&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal-specied.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal-specied.png&quot; alt=&quot;Sepal vs. Petal, Colored by Species&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Similarly, we can let the size of each point denote sepal width, by adding a &lt;code&gt;size = Sepal.Width&lt;/code&gt; argument.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;Sepal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; Petal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; iris&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; color &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; Species&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; size &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; Petal.Width&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# We see that Iris setosa flowers have the narrowest petals.&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal-sized.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal-sized.png&quot; alt=&quot;Sepal vs. Petal, Sized by Petal Width&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;Sepal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; Petal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; iris&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; color &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; Species&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; size &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; Petal.Width&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; alpha &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; I&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0.7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# By setting the alpha of each point to 0.7, we reduce the effects of overplotting.&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal-alpha.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal-alpha.png&quot; alt=&quot;Sepal vs. Petal, with Transparency&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Finally, let&amp;#8217;s fix the axis labels and add a title to the plot.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;Sepal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; Petal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; iris&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; color &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; Species&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  xlab &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Sepal Length&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; ylab &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Petal Length&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  main &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Sepal vs. Petal Length in Fisher&amp;#39;s Iris data&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal-titled.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal-titled.png&quot; alt=&quot;Sepal vs. Petal, Titled&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;Other common geoms&lt;/h2&gt;

&lt;p&gt;In the scatterplot examples above, we implicitly used a &lt;em&gt;point&lt;/em&gt; &lt;strong&gt;geom&lt;/strong&gt;, the default when you supply two arguments to &lt;code&gt;qplot()&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# These two invocations are equivalent.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;Sepal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; Petal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; iris&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; geom &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;point&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;Sepal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; Petal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; iris&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;But we can also easily use other types of geoms to create more kinds of plots.&lt;/p&gt;

&lt;h3&gt;Barcharts: geom = &amp;#8220;bar&amp;#8221;&lt;/h3&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;5&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;6&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;7&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;8&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;9&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;movies &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; data.frame&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  director &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;spielberg&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;spielberg&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;spielberg&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;jackson&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;jackson&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  movie &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;jaws&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;avatar&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;schindler&amp;#39;s list&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;lotr&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;king kong&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  minutes &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;124&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;163&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;195&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;600&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;187&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# Plot the number of movies each director has.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;director&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; movies&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; geom &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;bar&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; ylab &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;# movies&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# By default, the height of each bar is simply a count.&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/num-movies.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/num-movies.png&quot; alt=&quot;# Movies&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# But we can also supply a different weight.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# Here the height of each bar is the total running time of the director&amp;#39;s movies.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;director&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; weight &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; minutes&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; movies&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; geom &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;bar&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; ylab &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;total length (min.)&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/total-length.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/total-length.png&quot; alt=&quot;Total Running Time&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Line charts: geom = &amp;#8220;line&amp;#8221;&lt;/h3&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;Sepal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; Petal.Length&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; iris&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; geom &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;line&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; color &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; Species&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# Using a line geom doesn&amp;#39;t really make sense here, but hey.&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal-lined.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/sepal-vs-petal-lined.png&quot; alt=&quot;Sepal vs. Petal, Lined&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;3&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;4&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# `Orange` is another built-in data frame that describes the growth of orange trees.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;age&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; circumference&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; Orange&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; geom &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;line&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  colour &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; Tree&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;  main &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;How does orange tree circumference vary with age?&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/orange-tree-growth.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/orange-tree-growth.png&quot; alt=&quot;Orange Tree Growth&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;figure class=&#8217;code&#8217;&gt;&lt;figcaption&gt;&lt;span&gt;&lt;/span&gt;&lt;/figcaption&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;table&gt;&lt;tr&gt;&lt;td class=&quot;gutter&quot;&gt;&lt;pre class=&quot;line-numbers&quot;&gt;&lt;span class=&#8217;line-number&#8217;&gt;1&lt;/span&gt;
&lt;span class=&#8217;line-number&#8217;&gt;2&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&#8217;code&#8217;&gt;&lt;pre&gt;&lt;code class=&#8217;r&#8217;&gt;&lt;span class=&#8217;line&#8217;&gt;&lt;span class=&quot;c1&quot;&gt;# We can also plot both points and lines.&lt;/span&gt;
&lt;/span&gt;&lt;span class=&#8217;line&#8217;&gt;qplot&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;age&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; circumference&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; data &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; Orange&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; geom &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; c&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;point&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;line&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; colour &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; Tree&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;/figure&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/orange-tree-pointed.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/r/ggplot2/orange-tree-pointed.png&quot; alt=&quot;Orange Tree with Points&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that&amp;#8217;s it with what I&amp;#8217;ll cover.&lt;/p&gt;

&lt;h1&gt;Next Steps&lt;/h1&gt;

&lt;p&gt;I skipped over a lot of aspects of R and ggplot2 in this intro.&lt;/p&gt;

&lt;p&gt;For example,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There are many geoms (and other functionalities) in ggplot2 that I didn&amp;#8217;t cover, e.g., &lt;a href=&quot;http://had.co.nz/ggplot2/geom_boxplot.html&quot;&gt;boxplots&lt;/a&gt; and &lt;a href=&quot;http://had.co.nz/ggplot2/geom_histogram.html&quot;&gt;histograms&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;I didn&amp;#8217;t talk about ggplot2&amp;#8217;s layering system, or the &lt;a href=&quot;http://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448&quot;&gt;grammar of graphics&lt;/a&gt; it&amp;#8217;s based on.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So I&amp;#8217;ll end with some additional resources on R and ggplot2.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I don&amp;#8217;t use it myself, but &lt;a href=&quot;http://rstudio.org/&quot;&gt;RStudio&lt;/a&gt; is a popular IDE for R.&lt;/li&gt;
&lt;li&gt;The &lt;a href=&quot;http://had.co.nz/ggplot2/&quot;&gt;official ggplot2 documentation&lt;/a&gt; is great and has lots of examples. There&amp;#8217;s also an excellent &lt;a href=&quot;http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/dp/0387981403&quot;&gt;book&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://plyr.had.co.nz/&quot;&gt;plyr&lt;/a&gt; is another fantastic R package that&amp;#8217;s also by Hadley Wickham (the author of ggplot2).&lt;/li&gt;
&lt;li&gt;The &lt;a href=&quot;http://cran.r-project.org/doc/manuals/R-intro.html&quot;&gt;official R introduction&lt;/a&gt; is okay, but definitely not great. I haven&amp;#8217;t found any R tutorials I really like, but I&amp;#8217;ve heard good things about &lt;a href=&quot;http://www.amazon.com/Art-Programming-Statistical-Software-Design/dp/1593273843&quot;&gt;The Art of R Programming&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

</content>
  </entry>
  
  <entry>
    <title>Introduction to Conditional Random Fields</title>
    <link href="http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/"/>
    <updated>2012-01-03T16:02:25-08:00</updated>
    <id>http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields</id>
    <content type="html">&lt;p&gt;Imagine you have a sequence of snapshots from a day in Justin Bieber&amp;#8217;s life, and you want to label each image with the activity it represents (eating, sleeping, driving, etc.). How can you do this?&lt;/p&gt;

&lt;p&gt;One way is to ignore the sequential nature of the snapshots, and build a &lt;em&gt;per-image&lt;/em&gt; classifier. For example, given a month&amp;#8217;s worth of labeled snapshots, you might learn that dark images taken at 6am tend to be about sleeping, images with lots of bright colors tend to be about dancing, images of cars are about driving, and so on.&lt;/p&gt;

&lt;p&gt;By ignoring this sequential aspect, however, you lose a lot of information. For example, what happens if you see a close-up picture of a mouth &amp;#8211; is it about singing or eating? If you know that the &lt;em&gt;previous&lt;/em&gt; image is a picture of Justin Bieber eating or cooking, then it&amp;#8217;s more likely this picture is about eating; if, however, the previous image contains Justin Bieber singing or dancing, then this one probably shows him singing as well.&lt;/p&gt;

&lt;p&gt;Thus, to increase the accuracy of our labeler, we should incorporate the labels of nearby photos, and this is precisely what a &lt;strong&gt;conditional random field&lt;/strong&gt; does.&lt;/p&gt;

&lt;h1&gt;Part-of-Speech Tagging&lt;/h1&gt;

&lt;p&gt;Let&amp;#8217;s go into some more detail, using the more common example of &lt;strong&gt;part-of-speech tagging&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In POS tagging, the goal is to label a sentence (a sequence of words or tokens) with tags like ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB, ARTICLE.&lt;/p&gt;

&lt;p&gt;For example, given the sentence &amp;#8220;Bob drank coffee at Starbucks&amp;#8221;, the labeling might be &amp;#8220;Bob (NOUN) drank (VERB) coffee (NOUN) at (PREPOSITION) Starbucks (NOUN)&amp;#8221;.&lt;/p&gt;

&lt;p&gt;So let&amp;#8217;s build a conditional random field to label sentences with their parts of speech. Just like any classifier, we&amp;#8217;ll first need to decide on a set of feature functions $f_i$.&lt;/p&gt;

&lt;h2&gt;Feature Functions in a CRF&lt;/h2&gt;

&lt;p&gt;In a CRF, each &lt;strong&gt;feature function&lt;/strong&gt; is a function that takes in as input:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a sentence s&lt;/li&gt;
&lt;li&gt;the position i of a word in the sentence&lt;/li&gt;
&lt;li&gt;the label $l_i$ of the current word&lt;/li&gt;
&lt;li&gt;the label $l_{i-1}$ of the previous word&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;and outputs a real-valued number (though the numbers are often just either 0 or 1).&lt;/p&gt;

&lt;p&gt;(Note: by restricting our features to depend on only the &lt;em&gt;current&lt;/em&gt; and &lt;em&gt;previous&lt;/em&gt; labels, rather than arbitrary labels throughout the sentence, I&amp;#8217;m actually building the special case of a &lt;strong&gt;linear-chain CRF&lt;/strong&gt;. For simplicity, I&amp;#8217;m going to ignore general CRFs in this post.)&lt;/p&gt;

&lt;p&gt;For example, one possible feature function could measure how much we suspect that the current word should be labeled as an adjective given that the previous word is &amp;#8220;very&amp;#8221;.&lt;/p&gt;

&lt;h2&gt;Features to Probabilities&lt;/h2&gt;

&lt;p&gt;Next, assign each feature function $f_j$ a &lt;strong&gt;weight&lt;/strong&gt; $\lambda_j$ (I&amp;#8217;ll talk below about how to learn these weights from the data). Given a sentence s, we can now score a labeling l of s by adding up the weighted features over all words in the sentence:&lt;/p&gt;

&lt;p&gt;$score(l | s) = \sum_{j = 1}^m \sum_{i = 1}^n \lambda_j f_j(s, i, l_i, l_{i-1})$&lt;/p&gt;

&lt;p&gt;(The first sum runs over each feature function $j$, and the inner sum runs over each position $i$ of the sentence.)&lt;/p&gt;

&lt;p&gt;Finally, we can transform these scores into probabilities $p(l | s)$ between 0 and 1 by exponentiating and normalizing:&lt;/p&gt;

&lt;p&gt;$p(l | s) = \frac{exp[score(l|s)]}{\sum_{l&amp;#8217;} exp[score(l&amp;#8217;|s)]} = \frac{exp[\sum_{j = 1}^m \sum_{i = 1}^n \lambda_j f_j(s, i, l_i, l_{i-1})]}{\sum_{l&amp;#8217;} exp[\sum_{j = 1}^m \sum_{i = 1}^n \lambda_j f_j(s, i, l&amp;#8217;_i, l&amp;#8217;_{i-1})]}$&lt;/p&gt;

&lt;h2&gt;Example Feature Functions&lt;/h2&gt;

&lt;p&gt;So what do these feature functions look like? Examples of POS tagging features could include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$f_1(s, i, l_i, l_{i-1}) = 1$ if $l_i =$ ADVERB and the ith word ends in &amp;#8220;-ly&amp;#8221;; 0 otherwise.

&lt;ul&gt;
&lt;li&gt;If the weight $\lambda_1$ associated with this feature is large and positive, then this feature is essentially saying that we prefer labelings where words ending in -ly get labeled as ADVERB.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;$f_2(s, i, l_i, l_{i-1}) = 1$ if $i = 1$, $l_i =$ VERB, and the sentence ends in a question mark; 0 otherwise.

&lt;ul&gt;
&lt;li&gt;Again, if the weight $\lambda_2$ associated with this feature is large and positive, then labelings that assign VERB to the first word in a question (e.g., &amp;#8220;Is this a sentence beginning with a verb?&amp;#8221;) are preferred.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;$f_3(s, i, l_i, l_{i-1}) = 1$ if $l_{i-1} =$ ADJECTIVE and $l_i =$ NOUN; 0 otherwise.

&lt;ul&gt;
&lt;li&gt;Again, a positive weight for this feature means that adjectives tend to be followed by nouns.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;$f_4(s, i, l_i, l_{i-1}) = 1$ if $l_{i-1} =$ PREPOSITION and $l_i =$ PREPOSITION.

&lt;ul&gt;
&lt;li&gt;A &lt;em&gt;negative&lt;/em&gt; weight $\lambda_4$ for this function would mean that prepositions don&amp;#8217;t tend to follow prepositions, so we should avoid labelings where this happens.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;And that&amp;#8217;s it! To sum up: to build a conditional random field, you just define a bunch of feature functions (which can depend on the entire sentence, a current position, and nearby labels), assign them weights, and add them all together, transforming at the end to a probability if necessary.&lt;/p&gt;

&lt;p&gt;Now let&amp;#8217;s step back and compare CRFs to some other common machine learning techniques.&lt;/p&gt;

&lt;h1&gt;Smells like Logistic Regression&amp;#8230;&lt;/h1&gt;

&lt;p&gt;The form of the CRF probabilities
$p(l | s) = \frac{exp[\sum_{j = 1}^m \sum_{i = 1}^n f_j(s, i, l_i, l_{i-1})]}{\sum_{l&amp;#8217;} exp[\sum_{j = 1}^m \sum_{i = 1}^n f_j(s, i, l&amp;#8217;_i, l&amp;#8217;_{i-1})]}$
might look &lt;a href=&quot;http://en.wikipedia.org/wiki/Logistic_regression&quot;&gt;familiar&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;That&amp;#8217;s because CRFs are indeed basically the sequential version of &lt;strong&gt;logistic regression&lt;/strong&gt;: whereas logistic regression is a log-linear model for &lt;em&gt;classification&lt;/em&gt;, CRFs are a log-linear model for &lt;em&gt;sequential labels&lt;/em&gt;.&lt;/p&gt;

&lt;h1&gt;Looks like HMMs&amp;#8230;&lt;/h1&gt;

&lt;p&gt;Recall that &lt;strong&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/Hidden_Markov_model&quot;&gt;Hidden Markov Models&lt;/a&gt;&lt;/strong&gt; are another model for part-of-speech tagging (and sequential labeling in general). Whereas CRFs throw any bunch of functions together to get a label score, HMMs take a &lt;em&gt;generative&lt;/em&gt; approach to labeling, defining&lt;/p&gt;

&lt;p&gt;$p(l,s) = p(l_1) \prod_i p(l_i | l_{i-1}) p(w_i | l_i)$&lt;/p&gt;

&lt;p&gt;where&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$p(l_i | l_{i-1})$ are &lt;strong&gt;transition&lt;/strong&gt; probabilities (e.g., the probability that a preposition is followed by a noun);&lt;/li&gt;
&lt;li&gt;$p(w_i | l_i)$ are &lt;strong&gt;emission&lt;/strong&gt; probabilities (e.g., the probability that a noun emits the word &amp;#8220;dad&amp;#8221;).&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So how do HMMs compare to CRFs? CRFs are more powerful &amp;#8211; they can model everything HMMs can and more. One way of seeing this is as follows.&lt;/p&gt;

&lt;p&gt;Note that the log of the HMM probability is $\log p(l,s) = \log p(l_0) + \sum_i \log p(l_i | l_{i-1}) + \sum_i \log p(w_i | l_i)$. This has exactly the log-linear form of a CRF if we consider these log-probabilities to be the weights associated to binary transition and emission indicator features.&lt;/p&gt;

&lt;p&gt;That is, we can build a CRF equivalent to any HMM by&amp;#8230;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For each HMM &lt;em&gt;transition&lt;/em&gt; probability $ p(l_i = y | l_{i-1} = x) $, define a set of CRF transition features of the form $f_{x,y}(s, i, l_i, l_{i-1}) = 1$ if $l_i = y$ and $l_{i-1} = x$. Give each feature a weight of $w_{x,y} = \log p(l_i = y | l_{i-1} = x)$.&lt;/li&gt;
&lt;li&gt;Similarly, for each HMM &lt;em&gt;emission&lt;/em&gt; probability $p(w_i = z | l_{i} = x)$, define a set of CRF emission features of the form $g_{x,y}(s, i, l_i, l_{i-1}) = 1$ if $w_i = z$ and $l_i = x$. Give each feature a weight of $w_{x,z} = \log p(w_i = z | l_i = x)$.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Thus, the score $p(l|s)$ computed by a CRF using these feature functions is precisely proportional to the score computed by the associated HMM, and so every HMM is equivalent to some CRF.&lt;/p&gt;

&lt;p&gt;However, CRFs can model a much richer set of label distributions as well, for two main reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CRFs can define a much larger set of features.&lt;/strong&gt; Whereas HMMs are necessarily &lt;em&gt;local&lt;/em&gt; in nature (because they&amp;#8217;re constrained to binary transition and emission feature functions, which force each word to depend only on the current label and each label to depend only on the previous label), CRFs can use more &lt;em&gt;global&lt;/em&gt; features. For example, one of the features in our POS tagger above increased the probability of labelings that tagged the &lt;em&gt;first&lt;/em&gt; word of a sentence as a VERB if the &lt;em&gt;end&lt;/em&gt; of the sentence contained a question mark.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CRFs can have arbitrary weights.&lt;/strong&gt; Whereas the probabilities of an HMM must satisfy certain constraints (e.g., $0 &amp;lt;= p(w_i | l_i) &amp;lt;= 1, \sum_w p(w_i = w | l_1) = 1)$, the weights of a CRF are unrestricted (e.g., $\log p(w_i | l_i)$ can be anything it wants).&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;Learning Weights&lt;/h1&gt;

&lt;p&gt;Let&amp;#8217;s go back to the question of how to learn the feature weights in a CRF. One way is (surprise) to use &lt;strong&gt;gradient ascent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Assume we have a bunch of training examples (sentences and associated part-of-speech labels). Randomly initialize the weights of our CRF model.
To shift these randomly initialized weights to the correct ones, for each training example&amp;#8230;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Go through each feature function $f_i$, and calculate the gradient of the log probability of the training example with respect to $\lambda_i$: $\frac{\partial}{\partial w_j} \log p(l | s) = \sum_{j = 1}^m f_i(s, j, l_j, l_{j-1}) - \sum_{l&amp;#8217;} p(l&amp;#8217; | s) \sum_{j = 1}^m f_i(s, j, l&amp;#8217;_j, l&amp;#8217;_{j-1})$&lt;/li&gt;
&lt;li&gt;Note that the first term in the gradient is the contribution of feature $f_i$ under the &lt;em&gt;true&lt;/em&gt; label, and the second term in the gradient is the &lt;em&gt;expected&lt;/em&gt; contribution of feature $f_i$ under the current model. This is exactly the form you&amp;#8217;d expect gradient ascent to take.&lt;/li&gt;
&lt;li&gt;Move $\lambda_i$ in the direction of the gradient: $\lambda_i = \lambda_i + \alpha [\sum_{j = 1}^m f_i(s, j, l_j, l_{j-1}) - \sum_{l&amp;#8217;} p(l&amp;#8217; | s) \sum_{j = 1}^m f_i(s, j, l&amp;#8217;_j, l&amp;#8217;_{j-1})]$ where $\alpha$ is some learning rate.&lt;/li&gt;
&lt;li&gt;Repeat the previous steps until some stopping condition is reached (e.g., the updates fall below some threshold).&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;In other words, every step takes the difference between what we want the model to learn and the model&amp;#8217;s current state, and moves $\lambda_i$ in the direction of this difference.&lt;/p&gt;

&lt;h1&gt;Finding the Optimal Labeling&lt;/h1&gt;

&lt;p&gt;Suppose we&amp;#8217;ve trained our CRF model, and now a new sentence comes in. How do we do label it?&lt;/p&gt;

&lt;p&gt;The naive way is to calculate $p(l | s)$ for every possible labeling l, and then choose the label that maximizes this probability. However, since there are $k^m$ possible labels for a tag set of size k and a sentence of length m, this approach would have to check an exponential number of labels.&lt;/p&gt;

&lt;p&gt;A better way is to realize that (linear-chain) CRFs satisfy an &lt;a href=&quot;http://en.wikipedia.org/wiki/Optimal_substructure&quot;&gt;optimal substructure&lt;/a&gt; property that allows us to use a (polynomial-time) dynamic programming algorithm to find the optimal label, similar to the &lt;a href=&quot;http://en.wikipedia.org/wiki/Viterbi_algorithm&quot;&gt;Viterbi algorithm&lt;/a&gt; for HMMs.&lt;/p&gt;

&lt;h1&gt;A More Interesting Application&lt;/h1&gt;

&lt;p&gt;Okay, so part-of-speech tagging is kind of boring, and there are plenty of existing POS taggers out there. When might you use a CRF in real life?&lt;/p&gt;

&lt;p&gt;Suppose you want to mine Twitter for the types of presents people received for Christmas:&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet&quot;&gt;&lt;p&gt;What people on Twitter wanted for Christmas, and what they got: &lt;a href=&quot;http://t.co/EGeKTBgF&quot; title=&quot;http://twitter.com/edchedch/status/153683967315419136/photo/1&quot;&gt;twitter.com/edchedch/statu…&lt;/a&gt;&lt;/p&gt;— Edwin Chen (@edchedch) &lt;a href=&quot;https://twitter.com/edchedch/status/153683967315419136&quot; data-datetime=&quot;2012-01-02T03:48:10+00:00&quot;&gt;January 2, 2012&lt;/a&gt;&lt;/blockquote&gt;


&lt;script src=&quot;http://blog.echen.me//platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;


&lt;p&gt;(Yes, I just embedded a tweet. BOOM.)&lt;/p&gt;

&lt;p&gt;How can you figure out which words refer to gifts?&lt;/p&gt;

&lt;p&gt;To gather data for the graphs above, I simply looked for phrases of the form &amp;#8220;I want XXX for Christmas&amp;#8221; and &amp;#8220;I got XXX for Christmas&amp;#8221;. However, a more sophisticated CRF variant could use a GIFT part-of-speech-like tag (even adding other tags like GIFT-GIVER and GIFT-RECEIVER, to get even more information on who got what from whom) and treat this like a POS tagging problem. Features could be based around things like &amp;#8220;this word is a GIFT if the previous word was a GIFT-RECEIVER and the word before that was &amp;#8216;gave&amp;#8217;&amp;#8221; or &amp;#8220;this word is a GIFT if the next two words are &amp;#8216;for Christmas&amp;#8217;&amp;#8221;.&lt;/p&gt;

&lt;h1&gt;Fin&lt;/h1&gt;

&lt;p&gt;I&amp;#8217;ll end with some more random thoughts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I explicitly skipped over the graphical models framework that conditional random fields sit in, because I don&amp;#8217;t think they add much to an initial understanding of CRFs. But if you&amp;#8217;re interested in learning more, Daphne Koller is teaching a free, online course on &lt;a href=&quot;http://www.pgm-class.org/&quot;&gt;graphical models&lt;/a&gt; starting in January.&lt;/li&gt;
&lt;li&gt;Or, if you&amp;#8217;re more interested in the many NLP applications of CRFs (like part-of-speech tagging or &lt;a href=&quot;http://en.wikipedia.org/wiki/Named-entity_recognition&quot;&gt;named entity extraction&lt;/a&gt;), Manning and Jurafsky are teaching an &lt;a href=&quot;http://www.nlp-class.org/&quot;&gt;NLP class&lt;/a&gt; in the same spirit.&lt;/li&gt;
&lt;li&gt;I also glossed a bit over the analogy between CRFs:HMMs and Logistic Regression:Naive Bayes. This image (from &lt;a href=&quot;http://arxiv.org/pdf/1011.4088v1&quot;&gt;Sutton and McCallum&amp;#8217;s introduction to conditional random fields&lt;/a&gt;) sums it up, and shows the graphical model nature of CRFs as well:&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/crfs/crf-diagram.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/crfs/crf-diagram.png&quot; alt=&quot;CRF Diagram&quot; /&gt;&lt;/a&gt;&lt;/p&gt;</content>
  </entry>
  
  <entry>
    <title>Winning the Netflix Prize: A Summary</title>
    <link href="http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/"/>
    <updated>2011-10-24T16:27:01-07:00</updated>
    <id>http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary</id>
    <content type="html">&lt;p&gt;How was the &lt;a href=&quot;http://en.wikipedia.org/wiki/Netflix_Prize&quot;&gt;Netflix Prize&lt;/a&gt; won? I went through a lot of the Netflix Prize papers a couple years ago, so I&amp;#8217;ll try to give an overview of the techniques that went into the winning solution here.&lt;/p&gt;

&lt;h1&gt;Normalization of Global Effects&lt;/h1&gt;

&lt;p&gt;Suppose Alice rates Inception 4 stars. We can think of this rating as composed of several parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;baseline rating&lt;/strong&gt; (e.g., maybe the mean over all user-movie ratings is 3.1 stars).&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;Alice-specific effect&lt;/strong&gt; (e.g., maybe Alice tends to rate movies lower than the average user, so her ratings are -0.5 stars lower than we normally expect).&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;Inception-specific effect&lt;/strong&gt; (e.g., Inception is a pretty awesome movie, so its ratings are 0.7 stars higher than we normally expect).&lt;/li&gt;
&lt;li&gt;A less predictable effect based on the &lt;strong&gt;specific interaction&lt;/strong&gt; between Alice and Inception that accounts for the remainder of the stars (e.g., Alice really liked Inception because of its particular combination of Leonardo DiCaprio and neuroscience, so this rating gets an additional 0.7 stars).&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;In other words, we&amp;#8217;ve decomposed the 4-star rating into:
4 = [3.1 (the baseline rating) - 0.5 (the Alice effect) + 0.7 (the Inception effect)] + 0.7 (the specific interaction)&lt;/p&gt;

&lt;p&gt;So instead of having our models predict the 4-star rating itself, we could first try to remove the effect of the baseline predictors (the first three components) and have them predict the specific 0.7 stars. (I guess you can also think of this as a simple kind of boosting.)&lt;/p&gt;

&lt;p&gt;More generally, additional baseline predictors include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A factor that allows Alice&amp;#8217;s rating to (linearly) depend on the (square root of the) &lt;strong&gt;number of days since her first rating&lt;/strong&gt;. (For example, have you ever noticed that you become a harsher critic over time?)&lt;/li&gt;
&lt;li&gt;A factor that allows Alice&amp;#8217;s rating to depend on the &lt;strong&gt;number of days since the movie&amp;#8217;s first rating by anyone&lt;/strong&gt;. (If you&amp;#8217;re one of the first people to watch it, maybe it&amp;#8217;s because you&amp;#8217;re a huge fan and really excited to see it on DVD, so you&amp;#8217;ll tend to rate it higher.)&lt;/li&gt;
&lt;li&gt;A factor that allows Alice&amp;#8217;s rating to depend on the &lt;strong&gt;number of people who have rated Inception&lt;/strong&gt;. (Maybe Alice is a hipster who hates being part of the crowd.)&lt;/li&gt;
&lt;li&gt;A factor that allows Alice&amp;#8217;s rating to &lt;strong&gt;depend on the movie&amp;#8217;s overall rating&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;(Plus a bunch of others.)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;And, in fact, modeling these biases turned out to be fairly important: in their paper describing their final solution to the Netflix Prize, Bell and Koren write that&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Of the numerous new algorithmic contributions, I would like to highlight one &amp;#8211; those humble baseline predictors (or biases), which capture main effects in the data. While the literature mostly concentrates on the more sophisticated algorithmic aspects, we have learned that an accurate treatment of main effects is probably at least as signficant as coming up with modeling breakthroughs.&lt;/p&gt;&lt;/blockquote&gt;


&lt;p&gt;(For a perhaps more concrete example of why removing these biases is useful, suppose you know that Bob likes the same kinds of movies that Alice does. To predict Bob&amp;#8217;s rating of Inception, instead of simply predicting the same 4 stars that Alice rated, if we know that Bob tends to rate movies 0.3 stars higher than average, then we could first remove Alice&amp;#8217;s bias and then add in Bob&amp;#8217;s: 4 + 0.5 + 0.3 = 4.8.)&lt;/p&gt;

&lt;h1&gt;Neighborhood Models&lt;/h1&gt;

&lt;p&gt;Let&amp;#8217;s now look at some slightly more sophisticated models. As alluded to in the section above, one of the standard approaches to collaborative filtering is to use neighborhood models.&lt;/p&gt;

&lt;p&gt;Briefly, a neighborhood model works as follows. To predict Alice&amp;#8217;s rating of Titanic, you could do two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Item-item approach&lt;/strong&gt;: find a set of items similar to Titanic that Alice has also rated, and take the (weighted) mean of Alice&amp;#8217;s ratings on them.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User-user approach&lt;/strong&gt;: find a set of users similar to Alice who rated Titanic, and again take the mean of their ratings of Titanic.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;(See also my post on &lt;a href=&quot;http://blog.echen.me/2011/02/15/an-overview-of-item-to-item-collaborative-filtering-with-amazons-recommendation-system/&quot;&gt;item-to-item collaborative filtering on Amazon&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;The main questions, then, are (let&amp;#8217;s stick to the item-item approach for simplicity):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How do we find the set of similar items?&lt;/li&gt;
&lt;li&gt;How do we weight these items when taking their mean?&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The standard approach is to take some similarity metric (e.g., correlation or a Jaccard index) to define similarities between pairs of movies, take the K most similar movies under this metric (where K is perhaps chosen via cross-validation), and then use the same similarity metric when computing the weighted mean.&lt;/p&gt;

&lt;p&gt;This has a couple problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Neighbors aren&amp;#8217;t independent&lt;/strong&gt;, so using a standard similarity metric to define a weighted mean overcounts information. For example, suppose you ask five friends where you should eat tonight. Three of them went to Mexico last week and are sick of burritos, so they strongly recommend against a taqueria. Thus, your friends&amp;#8217; recommendations have a stronger bias than what you&amp;#8217;d get if you asked five friends who didn&amp;#8217;t know each other at all. (Compare with the situation where all three Lord of the Rings Movies are neighbors of Harry Potter.)&lt;/li&gt;
&lt;li&gt;Different movies should perhaps be using &lt;strong&gt;different numbers of neighbors&lt;/strong&gt;. Some movies may be predicted well by only one neighbor (e.g., Harry Potter 2 could be predicted well by Harry Potter 1 alone), some movies may require more, and some movies may have no good neighbors (so you should ignore your neighborhood algorithms entirely and let your other ratings models stand on their own).&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So another approach is the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can still use a similarity metric like correlation or cosine similarity to choose the set of similar items.&lt;/li&gt;
&lt;li&gt;But instead of using the similarity metric to define the interpolation weights in the mean calculations, you essentially perform a (sparse) &lt;strong&gt;linear regression to find the weights&lt;/strong&gt; that minimize the squared error between an item&amp;#8217;s rating and a linear combination of the ratings of its neighbors. Note that these weights are no longer constrained, so that if all neighbors are weak, then their weights will be close to zero and the neighborhood model will have a low effect.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;(A slightly more complicated user-user approach, similar to this item-item neighborhood approach, is also useful.)&lt;/p&gt;

&lt;h1&gt;Implicit Data&lt;/h1&gt;

&lt;p&gt;Adding on to the neighborhood approach, we can also let &lt;strong&gt;implicit data influence our predictions&lt;/strong&gt;. The mere fact that a user rated lots of science fiction movies but no westerns, suggests that the user likes science fiction better than cowboys. So using a similar framework as in the neighborhood ratings model, we can learn for Inception a set of &lt;strong&gt;offset weights&lt;/strong&gt; associated to Inception&amp;#8217;s movie neighbors.&lt;/p&gt;

&lt;p&gt;Whenever we want to predict how Bob rates Inception, we look at whether Bob rated each of Inception&amp;#8217;s neighbors. If he did, we add in the corresponding offset; if not, then we add nothing (and, thus, Bob&amp;#8217;s rating is implicitly penalized by the missing weight).&lt;/p&gt;

&lt;h1&gt;Matrix Factorization&lt;/h1&gt;

&lt;p&gt;Complementing the neighborhood approach to collaborative filtering is the matrix factorization approach. Whereas the neighborhood approach takes a very local approach to ratings (if you liked Harry Potter 1, then you&amp;#8217;ll like Harry Potter 2!), the factorization approach takes a more global view (we know that you like fantasy movies and that Harry Potter has a strong fantasy element, so we think that you&amp;#8217;ll like Harry Potter) that &lt;strong&gt;decomposes users and movies into a set of latent factors&lt;/strong&gt; (which we can think of as categories like &amp;#8220;fantasy&amp;#8221; or &amp;#8220;violence&amp;#8221;).&lt;/p&gt;

&lt;p&gt;In fact, matrix factorization methods were probably the most important class of techniques for winning the Netflix Prize. In their 2008 Progress Prize paper, Bell and Koren write&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;It seems that models based on matrix-factorization were found to be most accurate (and thus popular), as evident by recent publications and discussions on the Netflix Prize forum. We definitely agree to that, and would like to add that those matrix-factorization models also offer the important flexibility needed for modeling temporal effects and the binary view. Nonetheless, neighborhood models, which have been dominating most of the collaborative filtering literature, are still expected to be popular due to their practical characteristics - being able to handle new users/ratings without re-training and offering direct explanations to the recommendations.&lt;/p&gt;&lt;/blockquote&gt;


&lt;p&gt;The typical way to perform matrix factorizations is to perform a &lt;strong&gt;singular value decomposition&lt;/strong&gt; on the (sparse) ratings matrix (using stochastic gradient descent and regularizing the weights of the factors, possibly constraining the weights to be positive to get a type of non-negative matrix factorization). (Note that this &amp;#8220;SVD&amp;#8221; is a little different from the standard SVD learned in linear algebra, since not every user has rated every movie and so the ratings matrix contains many missing elements that we don&amp;#8217;t want to simply treat as 0.)&lt;/p&gt;

&lt;p&gt;Some SVD-inspired methods used in the Netflix Prize include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Standard SVD&lt;/strong&gt;: Once you&amp;#8217;ve represented users and movies as factor vectors, you can dot product Alice&amp;#8217;s vector with Inception&amp;#8217;s vector to get Alice&amp;#8217;s predicted rating of Inception.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Asymmetric SVD&lt;/strong&gt;: Instead of users having their own notion of factor vectors, we can represent users as a bag of items they have rated (or provided implicit feedback for). So Alice is now represented as a (possibly weighted) sum of the factor vectors of the items she has rated, and to get her predicted rating of Titanic, we can dot product this  representation with the factor vector of Titanic. From a practical perspective, this model has an added benefit in that no user parameterizations are needed, so we can use this approach to generate recommendations as soon as a user provides some feedback (which could just be views or clicks on an item, and not necessarily ratings), without needing to retrain the model to factorize the user.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SVD++&lt;/strong&gt;: Incorporate both the standard SVD and the asymmetric SVD model by representing users both by their own factor representation and as a bag of item vectors.&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;Regression&lt;/h1&gt;

&lt;p&gt;Some regression models were also used in the predictions. The models are fairly standard, I think, so I won&amp;#8217;t spend too long here. Basically, just as with the neighborhood models, we can take a user-centric approach and a movie-centric approach to regression:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;User-centric approach&lt;/strong&gt;: We learn a regression model for each user, using all the movies that the user rated as the dataset. The response is the movie&amp;#8217;s rating, and the predictor variables are attributes associated to that movie (which can be derived from, say, PCA, MDS, or an SVD).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Movie-centric approach&lt;/strong&gt;: Similarly, we can learn a regression model for each movie, using all the users that rated the movie as the dataset.&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;Restricted Boltzmann Machines&lt;/h1&gt;

&lt;p&gt;Restricted Boltzmann Machines provide another kind of &lt;strong&gt;latent factor approach&lt;/strong&gt; that can be used. See &lt;a href=&quot;http://www.machinelearning.org/proceedings/icml2007/papers/407.pdf&quot;&gt;this paper&lt;/a&gt; for a description of how to apply them to the Netflix Prize. (In case the paper&amp;#8217;s a little difficult to read, I wrote an &lt;a href=&quot;http://blog.echen.me/2011/07/18/introduction-to-restricted-boltzmann-machines&quot;&gt;introduction to RBMs&lt;/a&gt; a little while ago.)&lt;/p&gt;

&lt;h1&gt;Temporal Effects&lt;/h1&gt;

&lt;p&gt;Many of the models incorporate temporal effects. For example, when describing the baseline predictors above, we used a few temporal predictors that allowed a user&amp;#8217;s rating to (linearly) depend on the time since the first rating he ever made and on the time since a movie&amp;#8217;s first rating. We can also get more fine-grained temporal effects by, say, binning items into a couple months&amp;#8217; worth of ratings at a time, and allowing movie biases to change within each bin. (For example, maybe in May 2006, Time Magazine nominated Titanic as the best movie ever made, which caused a spurt in glowing ratings around that time.)&lt;/p&gt;

&lt;p&gt;In the matrix factorization approach, user factors were also allowed to be time-dependent (e.g., maybe Bob comes to like comedy movies more and more over time). We can also give more weight to recent user actions.&lt;/p&gt;

&lt;h1&gt;Regularization&lt;/h1&gt;

&lt;p&gt;Regularization was also applied throughout pretty much all the models learned, to &lt;strong&gt;prevent overfitting&lt;/strong&gt; on the dataset. Ridge regression was heavily used in the factorization models to penalize large weights, and lasso regression (though less effective) was useful as well. Many other parameters (e.g., the baseline predictors, similarity weights and interpolation weights in the neighborhood models) were also estimated using fairly standard shrinkage techniques.&lt;/p&gt;

&lt;h1&gt;Ensemble Methods&lt;/h1&gt;

&lt;p&gt;Finally, let&amp;#8217;s talk about how all of these different algorithms were combined to provide a single rating that &lt;strong&gt;exploits the strengths of each model&lt;/strong&gt;. (Note that, as mentioned above, many of these models were not trained on the raw ratings data directly, but rather on the residuals of other models.)&lt;/p&gt;

&lt;p&gt;In the paper detailing their final solution, the winners describe using &lt;strong&gt;gradient boosted decision trees to combine over 500 models&lt;/strong&gt;; previous solutions used instead a &lt;strong&gt;linear regression&lt;/strong&gt; to combine the predictors.&lt;/p&gt;

&lt;p&gt;Briefly, gradient boosted decision trees work by sequentially fitting a series of decision trees to the data; each tree is asked to predict the error made by the previous trees, and is often trained on slightly perturbed versions of the data. (For a longer description of a similar technique, see &lt;a href=&quot;http://blog.echen.me/2011/03/14/laymans-introduction-to-random-forests/&quot;&gt;my introduction to random forests&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;Since GBDTs have a built-in ability to apply different methods to different slices of the data, we can add in some predictors that help the trees make useful clusterings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Number of movies each user rated&lt;/li&gt;
&lt;li&gt;Number of users that rated each movie&lt;/li&gt;
&lt;li&gt;Factor vectors of users and movies&lt;/li&gt;
&lt;li&gt;Hidden units of a restricted Boltzmann Machine&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;(For example, one thing that Bell and Koren found (when using an earlier ensemble method) was that RBMs are more useful when the movie or the user has a low number of ratings, and that matrix factorization methods are more useful when the movie or user has a high number of ratings.)&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s a graph of the effect of ensemble size from early on in the competition (in 2007), and the authors&amp;#8217; take on it:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www2.research.att.com/~volinsky/netflix/newensemble.gif&quot;&gt;&lt;img src=&quot;http://www2.research.att.com/~volinsky/netflix/newensemble.gif&quot; alt=&quot;Ensemble Size vs. RMSE&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;However, we would like to stress that it is not necessary to have such a large number of models to do well. The plot below shows RMSE as a function of the number of methods used. One can achieve our winning score (RMSE=0.8712) with less than 50 methods, using the best 3 methods can yield RMSE &lt; 0.8800, which would land in the top 10. Even just using our single best method puts us on the leaderboard with an RMSE of 0.8890. The lesson here is that having lots of models is useful for the incremental results needed to win competitions, but practically, excellent systems can be built with just a few well-selected models.&lt;/p&gt;&lt;/blockquote&gt;

</content>
  </entry>
  
  <entry>
    <title>Stuff Harvard People Like</title>
    <link href="http://blog.echen.me/2011/09/29/stuff-harvard-people-like/"/>
    <updated>2011-09-29T10:48:09-07:00</updated>
    <id>http://blog.echen.me/2011/09/29/stuff-harvard-people-like</id>
    <content type="html">&lt;p&gt;What types of students go to which schools? There are, of course, the classic stereotypes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MIT&lt;/strong&gt; has the hacker engineers.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stanford&lt;/strong&gt; has the laid-back, social folks.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Harvard&lt;/strong&gt; has the prestigious leaders of the world.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Berkeley&lt;/strong&gt; has the activist hippies.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Caltech&lt;/strong&gt; has the hardcore science nerds.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;But how well do these perceptions match reality? What are students at Stanford, Harvard, MIT, Caltech, and Berkeley &lt;em&gt;really&lt;/em&gt; interested in? Following the path of my previous data-driven post on &lt;a href=&quot;http://blog.echen.me/2011/04/18/twifferences-between-californians-and-new-yorkers/&quot;&gt;differences between Silicon Valley and NYC&lt;/a&gt;, I scraped the Quora profiles of a couple hundred followers of each school to find out.&lt;/p&gt;

&lt;h1&gt;Topics&lt;/h1&gt;

&lt;p&gt;So let&amp;#8217;s look at what kinds of topics followers of each school are interested in*. (Skip past the lists for a discussion.)&lt;/p&gt;

&lt;h2&gt;MIT&lt;/h2&gt;

&lt;p&gt;Topics are followed by p(school = MIT|topic).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;MIT Media Lab&lt;/strong&gt;                             0.893&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ksplice&lt;/strong&gt;                                   0.69&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Lisp (programming language)&lt;/strong&gt;               0.677&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nokia&lt;/strong&gt;                                     0.659&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Public Speaking&lt;/strong&gt;                           0.65&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Storage&lt;/strong&gt;                              0.65&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google Voice&lt;/strong&gt;                              0.609&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hacking&lt;/strong&gt;                                   0.602&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Startups in Europe&lt;/strong&gt;                        0.597&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Startup Names&lt;/strong&gt;                             0.572&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mechanical Engineering&lt;/strong&gt;                    0.563&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Engineering&lt;/strong&gt;                               0.563&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Distributed Databases&lt;/strong&gt;                     0.544&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;StackOverflow&lt;/strong&gt;                             0.536&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Boston&lt;/strong&gt;                                    0.513&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Learning&lt;/strong&gt;                                  0.507&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open Source&lt;/strong&gt;                               0.498&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cambridge&lt;/strong&gt;                                 0.496&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Public Relations&lt;/strong&gt;                          0.493&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Visualization&lt;/strong&gt;                             0.492&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Semantic Web&lt;/strong&gt;                              0.486&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Andreessen-Horowitz&lt;/strong&gt;                       0.483&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nature&lt;/strong&gt;                                    0.475&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cryptography&lt;/strong&gt;                              0.474&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Startups in Boston&lt;/strong&gt;                        0.452&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Adobe Photoshop&lt;/strong&gt;                           0.451&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Computer Security&lt;/strong&gt;                         0.447&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sachin Tendulkar&lt;/strong&gt;                          0.443&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hacker News&lt;/strong&gt;                               0.442&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Games&lt;/strong&gt;                                     0.429&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Android Applications&lt;/strong&gt;                      0.428&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Best Engineers and Programmers&lt;/strong&gt;            0.427&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;College Admissions &amp;amp; Getting Into College&lt;/strong&gt; 0.422&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Co-Founders&lt;/strong&gt;                               0.419&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Big Data&lt;/strong&gt;                                  0.41&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;System Administration&lt;/strong&gt;                     0.4&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Biotechnology&lt;/strong&gt;                             0.398&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Higher Education&lt;/strong&gt;                          0.394&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;NoSQL&lt;/strong&gt;                                     0.387&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User Experience&lt;/strong&gt;                           0.386&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Career Advice&lt;/strong&gt;                             0.377&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Artificial Intelligence&lt;/strong&gt;                   0.375&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;                               0.37&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Taylor Swift&lt;/strong&gt;                              0.368&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google Search&lt;/strong&gt;                             0.368&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Functional Programming&lt;/strong&gt;                    0.365&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bing&lt;/strong&gt;                                      0.363&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bioinformatics&lt;/strong&gt;                            0.361&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How I Met Your Mother (TV series)&lt;/strong&gt;         0.361&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Operating Systems&lt;/strong&gt;                         0.356&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compilers&lt;/strong&gt;                                 0.355&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google Chrome&lt;/strong&gt;                             0.354&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Management &amp;amp; Organizational Leadership&lt;/strong&gt;    0.35&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Literary Fiction&lt;/strong&gt;                          0.35&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Intelligence&lt;/strong&gt;                              0.348&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fight Club (1999 movie)&lt;/strong&gt;                   0.344&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hip Hop Music&lt;/strong&gt;                             0.34&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UX Design&lt;/strong&gt;                                 0.337&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Web Application Frameworks&lt;/strong&gt;                0.336&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Startups in New York City&lt;/strong&gt;                 0.333&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Book Recommendations&lt;/strong&gt;                      0.33&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Engineering Recruiting&lt;/strong&gt;                    0.33&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Search Engines&lt;/strong&gt;                            0.329&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Social Search&lt;/strong&gt;                             0.329&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Science&lt;/strong&gt;                              0.328&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;History&lt;/strong&gt;                                   0.328&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interaction Design&lt;/strong&gt;                        0.326&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Classification (machine learning)&lt;/strong&gt;         0.322&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Startup Incubators and Seed Programs&lt;/strong&gt;      0.321&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Graphic Design&lt;/strong&gt;                            0.321&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Product Design (software)&lt;/strong&gt;                 0.319&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The College Experience&lt;/strong&gt;                    0.319&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Writing&lt;/strong&gt;                                   0.319&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MapReduce&lt;/strong&gt;                                 0.318&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Database Systems&lt;/strong&gt;                          0.315&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User Interfaces&lt;/strong&gt;                           0.314&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Literature&lt;/strong&gt;                                0.314&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;C (programming language)&lt;/strong&gt;                  0.314&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Television&lt;/strong&gt;                                0.314&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reading&lt;/strong&gt;                                   0.313&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Usability&lt;/strong&gt;                                 0.312&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Books&lt;/strong&gt;                                     0.312&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Computers&lt;/strong&gt;                                 0.311&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stealth Startups&lt;/strong&gt;                          0.311&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Daft Punk&lt;/strong&gt;                                 0.31&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Healthy Eating&lt;/strong&gt;                            0.309&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Innovation&lt;/strong&gt;                                0.309&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skiing&lt;/strong&gt;                                    0.305&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;JavaScript&lt;/strong&gt;                                0.304&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rock Music&lt;/strong&gt;                                0.304&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mozilla Firefox&lt;/strong&gt;                           0.304&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Self-Improvement&lt;/strong&gt;                          0.303&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;McKinsey &amp;amp; Company&lt;/strong&gt;                        0.302&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AngelList&lt;/strong&gt;                                 0.301&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Visualization&lt;/strong&gt;                        0.301&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cassandra (database)&lt;/strong&gt;                      0.301&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;Stanford&lt;/h2&gt;

&lt;p&gt;Topics are followed by p(school = Stanford|topic).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Stanford Computer Science&lt;/strong&gt;                 0.951&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stanford Graduate School of Business&lt;/strong&gt;      0.939&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stanford&lt;/strong&gt;                                  0.896&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stanford Football&lt;/strong&gt;                         0.896&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stanford Cardinal&lt;/strong&gt;                         0.896&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Social Dance&lt;/strong&gt;                              0.847&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stanford University Courses&lt;/strong&gt;               0.847&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Romance&lt;/strong&gt;                                   0.769&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Instagram&lt;/strong&gt;                                 0.745&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;College Football&lt;/strong&gt;                          0.665&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mobile Location Applications&lt;/strong&gt;              0.634&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Online Communities&lt;/strong&gt;                        0.621&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interpersonal Relationships&lt;/strong&gt;               0.585&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Food &amp;amp; Restaurants in Palo Alto&lt;/strong&gt;           0.572&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Your 20s&lt;/strong&gt;                                  0.566&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Men&amp;#8217;s Fashion&lt;/strong&gt;                             0.548&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flipboard&lt;/strong&gt;                                 0.537&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inception (2010 movie)&lt;/strong&gt;                    0.535&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tumblr&lt;/strong&gt;                                    0.531&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;People Skills&lt;/strong&gt;                             0.522&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Exercise&lt;/strong&gt;                                  0.52&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Joel Spolsky&lt;/strong&gt;                              0.516&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Valuations&lt;/strong&gt;                                0.515&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Social Network (2010 movie)&lt;/strong&gt;           0.513&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LeBron James&lt;/strong&gt;                              0.506&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Northern California&lt;/strong&gt;                       0.506&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evernote&lt;/strong&gt;                                  0.5&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quora Community&lt;/strong&gt;                           0.5&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Blogging&lt;/strong&gt;                                  0.49&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Downtown Palo Alto&lt;/strong&gt;                        0.487&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The College Experience&lt;/strong&gt;                    0.485&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consumer Internet&lt;/strong&gt;                         0.477&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Restaurants in San Francisco&lt;/strong&gt;              0.477&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chad Hurley&lt;/strong&gt;                               0.47&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Meditation&lt;/strong&gt;                                0.468&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Yishan Wong&lt;/strong&gt;                               0.466&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Arrested Development (TV series)&lt;/strong&gt;          0.463&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;fbFund&lt;/strong&gt;                                    0.457&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Best Engineers at X Company&lt;/strong&gt;               0.451&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Language&lt;/strong&gt;                                  0.45&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Words&lt;/strong&gt;                                     0.448&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Happiness&lt;/strong&gt;                                 0.447&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Path (company)&lt;/strong&gt;                            0.446&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Color Labs (startup)&lt;/strong&gt;                      0.446&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Palo Alto&lt;/strong&gt;                                 0.445&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Woot.com&lt;/strong&gt;                                  0.442&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Beer&lt;/strong&gt;                                      0.442&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;PayPal&lt;/strong&gt;                                    0.441&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Women in Startups&lt;/strong&gt;                         0.438&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Techmeme&lt;/strong&gt;                                  0.433&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Women in Engineering&lt;/strong&gt;                      0.428&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Mission (San Francisco neighborhood)&lt;/strong&gt;  0.427&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;iPhone Applications&lt;/strong&gt;                       0.416&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Asana&lt;/strong&gt;                                     0.413&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monetization&lt;/strong&gt;                              0.412&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Repetitive Strain Injury (RSI)&lt;/strong&gt;            0.4&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;IDEO&lt;/strong&gt;                                      0.398&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Spotify&lt;/strong&gt;                                   0.397&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;San Francisco Giants&lt;/strong&gt;                      0.396&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fortune Magazine&lt;/strong&gt;                          0.389&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Love&lt;/strong&gt;                                      0.387&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Human-Computer Interaction&lt;/strong&gt;                0.382&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hip Hop Music&lt;/strong&gt;                             0.378&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Self-Improvement&lt;/strong&gt;                          0.378&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Food in San Francisco&lt;/strong&gt;                     0.375&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quora (company)&lt;/strong&gt;                           0.374&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quora Infrastructure&lt;/strong&gt;                      0.373&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;iPhone&lt;/strong&gt;                                    0.371&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Square (company)&lt;/strong&gt;                          0.369&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Social Psychology&lt;/strong&gt;                         0.369&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Network Effects&lt;/strong&gt;                           0.366&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chris Sacca&lt;/strong&gt;                               0.365&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Walt Mossberg&lt;/strong&gt;                             0.364&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Salesforce.com&lt;/strong&gt;                            0.362&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sex&lt;/strong&gt;                                       0.361&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Etiquette&lt;/strong&gt;                                 0.361&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;David Pogue&lt;/strong&gt;                               0.361&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gowalla&lt;/strong&gt;                                   0.36&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;iOS Development&lt;/strong&gt;                           0.354&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Palantir Technologies&lt;/strong&gt;                     0.353&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mobile Computing&lt;/strong&gt;                          0.347&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sports&lt;/strong&gt;                                    0.346&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Video Games&lt;/strong&gt;                               0.345&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Burning Man&lt;/strong&gt;                               0.345&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Engineering Management&lt;/strong&gt;                    0.343&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cognitive Science&lt;/strong&gt;                         0.342&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dating &amp;amp; Relationships&lt;/strong&gt;                    0.341&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fred Wilson (venture investor)&lt;/strong&gt;            0.337&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Taiwan&lt;/strong&gt;                                    0.333&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Natural Language Processing&lt;/strong&gt;               0.33&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Eric Schmidt&lt;/strong&gt;                              0.329&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Social Advice&lt;/strong&gt;                             0.329&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Engineering Recruiting&lt;/strong&gt;                    0.328&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Job Interviews&lt;/strong&gt;                            0.325&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mobile Phones&lt;/strong&gt;                             0.324&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Twitter Inc. (company)&lt;/strong&gt;                    0.321&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Engineering in Silicon Valley&lt;/strong&gt;             0.321&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;San Francisco Bay Area&lt;/strong&gt;                    0.321&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google Analytics&lt;/strong&gt;                          0.32&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fashion&lt;/strong&gt;                                   0.315&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Interaction Design&lt;/strong&gt;                        0.314&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Open Graph&lt;/strong&gt;                                0.313&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Drugs &amp;amp; Pharmaceuticals&lt;/strong&gt;                   0.312&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Electronic Music&lt;/strong&gt;                          0.312&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Facebook Inc. (company)&lt;/strong&gt;                   0.309&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fitness&lt;/strong&gt;                                   0.309&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;YouTube&lt;/strong&gt;                                   0.308&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TED Talks&lt;/strong&gt;                                 0.308&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Freakonomics (2005 Book)&lt;/strong&gt;                  0.307&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Jack Dorsey&lt;/strong&gt;                               0.306&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Nutrition&lt;/strong&gt;                                 0.305&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Puzzles&lt;/strong&gt;                                   0.305&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Silicon Valley Mergers &amp;amp; Acquisitions&lt;/strong&gt;     0.304&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Viral Growth &amp;amp; Analytics&lt;/strong&gt;                  0.304&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Amazon Web Services&lt;/strong&gt;                       0.304&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;StumbleUpon&lt;/strong&gt;                               0.303&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Exceptional Comment Threads&lt;/strong&gt;               0.303&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;Harvard&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Harvard Business School&lt;/strong&gt;                   0.968&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Harvard Business Review&lt;/strong&gt;                   0.922&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Harvard Square&lt;/strong&gt;                            0.912&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Harvard Law School&lt;/strong&gt;                        0.912&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Jimmy Fallon&lt;/strong&gt;                              0.899&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Boston Red Sox&lt;/strong&gt;                            0.658&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Klout&lt;/strong&gt;                                     0.644&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Oprah Winfrey&lt;/strong&gt;                             0.596&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ivanka Trump&lt;/strong&gt;                              0.587&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dalai Lama&lt;/strong&gt;                                0.569&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Food in New York City&lt;/strong&gt;                     0.565&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;U2&lt;/strong&gt;                                        0.562&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TwitPic&lt;/strong&gt;                                   0.534&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;37signals&lt;/strong&gt;                                 0.522&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;David Lynch (director)&lt;/strong&gt;                    0.512&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Al Gore&lt;/strong&gt;                                   0.508&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TechStars&lt;/strong&gt;                                 0.49&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Baseball&lt;/strong&gt;                                  0.487&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Private Equity&lt;/strong&gt;                            0.471&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Classical Music&lt;/strong&gt;                           0.46&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Startups in New York City&lt;/strong&gt;                 0.458&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;HootSuite&lt;/strong&gt;                                 0.449&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kiva&lt;/strong&gt;                                      0.442&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ultimate Frisbee&lt;/strong&gt;                          0.441&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Huffington Post&lt;/strong&gt;                           0.436&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New York City&lt;/strong&gt;                             0.433&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Charlie Cheever&lt;/strong&gt;                           0.433&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The New York Times&lt;/strong&gt;                        0.431&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Technology Journalism&lt;/strong&gt;                     0.431&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;McKinsey &amp;amp; Company&lt;/strong&gt;                        0.427&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TweetDeck&lt;/strong&gt;                                 0.422&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How Does X Work?&lt;/strong&gt;                          0.417&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ashton Kutcher&lt;/strong&gt;                            0.414&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Coldplay&lt;/strong&gt;                                  0.402&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Conan O&amp;#8217;Brien&lt;/strong&gt;                             0.397&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fast Company&lt;/strong&gt;                              0.397&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;WikiLeaks&lt;/strong&gt;                                 0.394&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Michael Jackson&lt;/strong&gt;                           0.389&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Guy Kawasaki&lt;/strong&gt;                              0.389&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Journalism&lt;/strong&gt;                                0.384&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wall Street Journal&lt;/strong&gt;                       0.384&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cambridge&lt;/strong&gt;                                 0.371&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Seattle&lt;/strong&gt;                                   0.37&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cities &amp;amp; Metro Areas&lt;/strong&gt;                      0.357&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Boston&lt;/strong&gt;                                    0.353&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Tim Ferriss (author)&lt;/strong&gt;                      0.35&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The New Yorker&lt;/strong&gt;                            0.343&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Law&lt;/strong&gt;                                       0.34&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mashable&lt;/strong&gt;                                  0.338&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Politics&lt;/strong&gt;                                  0.335&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Economist&lt;/strong&gt;                             0.334&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Barack Obama&lt;/strong&gt;                              0.333&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skiing&lt;/strong&gt;                                    0.329&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;McKinsey Quarterly&lt;/strong&gt;                        0.325&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wired (magazine)&lt;/strong&gt;                          0.316&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Bill Gates&lt;/strong&gt;                                0.31&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mad Men (TV series)&lt;/strong&gt;                       0.308&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;India&lt;/strong&gt;                                     0.306&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TED Talks&lt;/strong&gt;                                 0.306&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Netflix&lt;/strong&gt;                                   0.304&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wine&lt;/strong&gt;                                      0.303&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Angel Investors&lt;/strong&gt;                           0.302&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Facebook Ads&lt;/strong&gt;                              0.301&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;UC Berkeley&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Berkeley&lt;/strong&gt;                                  0.978&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;California Golden Bears&lt;/strong&gt;                   0.91&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Internships&lt;/strong&gt;                               0.717&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Web Marketing&lt;/strong&gt;                             0.484&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google Social Strategy&lt;/strong&gt;                    0.453&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Southwest Airlines&lt;/strong&gt;                        0.451&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;WordPress&lt;/strong&gt;                                 0.429&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stock Market&lt;/strong&gt;                              0.429&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;BMW (automobile)&lt;/strong&gt;                          0.428&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Web Applications&lt;/strong&gt;                          0.423&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flickr&lt;/strong&gt;                                    0.422&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Snowboarding&lt;/strong&gt;                              0.42&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Electronic Music&lt;/strong&gt;                          0.404&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MySQL&lt;/strong&gt;                                     0.401&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Internet Advertising&lt;/strong&gt;                      0.399&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Search Engine Optimization (SEO)&lt;/strong&gt;          0.398&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Yelp&lt;/strong&gt;                                      0.396&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Groupon&lt;/strong&gt;                                   0.393&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;In-N-Out Burger&lt;/strong&gt;                           0.391&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Matrix (1999 movie)&lt;/strong&gt;                   0.389&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trading (finance)&lt;/strong&gt;                         0.385&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;jQuery&lt;/strong&gt;                                    0.381&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hedge Funds&lt;/strong&gt;                               0.378&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Social Media Marketing&lt;/strong&gt;                    0.377&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;San Francisco&lt;/strong&gt;                             0.376&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stealth Startups&lt;/strong&gt;                          0.362&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Yahoo!&lt;/strong&gt;                                    0.36&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cascading Style Sheets&lt;/strong&gt;                    0.359&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Angel Investors&lt;/strong&gt;                           0.355&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UX Design&lt;/strong&gt;                                 0.35&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;StarCraft&lt;/strong&gt;                                 0.348&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Los Angeles Lakers&lt;/strong&gt;                        0.347&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mountain View&lt;/strong&gt;                             0.345&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How I Met Your Mother (TV series)&lt;/strong&gt;         0.338&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Google+&lt;/strong&gt;                                   0.337&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ruby on Rails&lt;/strong&gt;                             0.333&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reading&lt;/strong&gt;                                   0.333&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Social Media&lt;/strong&gt;                              0.326&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;China&lt;/strong&gt;                                     0.322&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Palantir Technologies&lt;/strong&gt;                     0.319&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Facebook Platform&lt;/strong&gt;                         0.315&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Basketball&lt;/strong&gt;                                0.315&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Education&lt;/strong&gt;                                 0.314&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Business Development&lt;/strong&gt;                      0.312&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Online &amp;amp; Mobile Payments&lt;/strong&gt;                  0.305&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Restaurants in San Francisco&lt;/strong&gt;              0.302&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Technology Companies&lt;/strong&gt;                      0.302&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Seth Godin&lt;/strong&gt;                                0.3&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;Caltech&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Pasadena&lt;/strong&gt;                                  0.969&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chess&lt;/strong&gt;                                     0.748&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Table Tennis&lt;/strong&gt;                              0.671&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;UCLA&lt;/strong&gt;                                      0.67&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MacBook Pro&lt;/strong&gt;                               0.618&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Physics&lt;/strong&gt;                                   0.618&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Haskell&lt;/strong&gt;                                   0.582&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Los Angeles&lt;/strong&gt;                               0.58&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Electrical Engineering&lt;/strong&gt;                    0.567&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Star Trek (movie&lt;/strong&gt;                          0.561&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Disruptive Technology&lt;/strong&gt;                     0.545&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Science&lt;/strong&gt;                                   0.53&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Biology&lt;/strong&gt;                                   0.526&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quantum Mechanics&lt;/strong&gt;                         0.521&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;LaTeX&lt;/strong&gt;                                     0.514&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mathematics&lt;/strong&gt;                               0.488&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;xkcd&lt;/strong&gt;                                      0.488&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Genetics &amp;amp; Heredity&lt;/strong&gt;                       0.487&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Chemistry&lt;/strong&gt;                                 0.47&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Medicine &amp;amp; Healthcare&lt;/strong&gt;                     0.448&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Poker&lt;/strong&gt;                                     0.445&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;C++ (programming language)&lt;/strong&gt;                0.442&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Structures&lt;/strong&gt;                           0.434&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Emacs&lt;/strong&gt;                                     0.428&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MongoDB&lt;/strong&gt;                                   0.423&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Neuroscience&lt;/strong&gt;                              0.404&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Science Fiction&lt;/strong&gt;                           0.4&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mac OS X&lt;/strong&gt;                                  0.394&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Board Games&lt;/strong&gt;                               0.387&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Computers&lt;/strong&gt;                                 0.386&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Research&lt;/strong&gt;                                  0.385&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Finance&lt;/strong&gt;                                   0.385&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Future&lt;/strong&gt;                                0.379&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Linux&lt;/strong&gt;                                     0.378&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Colbert Report&lt;/strong&gt;                        0.376&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Beatles&lt;/strong&gt;                               0.374&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The Onion&lt;/strong&gt;                                 0.365&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ruby&lt;/strong&gt;                                      0.363&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cars &amp;amp; Automobiles&lt;/strong&gt;                        0.361&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quantitative Finance&lt;/strong&gt;                      0.359&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Academia&lt;/strong&gt;                                  0.359&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Law&lt;/strong&gt;                                       0.355&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cooking&lt;/strong&gt;                                   0.354&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Psychology&lt;/strong&gt;                                0.349&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Eminem&lt;/strong&gt;                                    0.347&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Football (Soccer)&lt;/strong&gt;                         0.346&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Computer Programming&lt;/strong&gt;                      0.343&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Algorithms&lt;/strong&gt;                                0.343&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evolutionary Biology&lt;/strong&gt;                      0.337&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Behavioral Economics&lt;/strong&gt;                      0.335&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;California&lt;/strong&gt;                                0.329&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Machine Learning&lt;/strong&gt;                          0.326&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Futurama&lt;/strong&gt;                                  0.324&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Social Advice&lt;/strong&gt;                             0.324&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;StarCraft II&lt;/strong&gt;                              0.319&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Job Interview Questions&lt;/strong&gt;                   0.318&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Game Theory&lt;/strong&gt;                               0.316&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;This American Life&lt;/strong&gt;                        0.315&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Economics&lt;/strong&gt;                                 0.314&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Vim&lt;/strong&gt;                                       0.31&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Graduate School&lt;/strong&gt;                           0.309&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Git (revision control)&lt;/strong&gt;                    0.306&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Computer Science&lt;/strong&gt;                          0.303&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;What do we see?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, in a nice validation of this approach, we find that each school is interested in exactly the &lt;strong&gt;locations&lt;/strong&gt; we&amp;#8217;d expect: Caltech is interested in &lt;em&gt;Pasadena&lt;/em&gt; and &lt;em&gt;Los Angeles&lt;/em&gt;; MIT and Harvard are both interested in &lt;em&gt;Boston&lt;/em&gt; and &lt;em&gt;Cambridge&lt;/em&gt; (Harvard is interested in &lt;em&gt;New York City&lt;/em&gt; as well); Stanford is interested in &lt;em&gt;Palo Alto&lt;/em&gt;, &lt;em&gt;Northern California&lt;/em&gt;, and &lt;em&gt;San Francisco Bay Area&lt;/em&gt;; and Berkeley is interested in &lt;em&gt;Berkeley&lt;/em&gt;, &lt;em&gt;San Francisco&lt;/em&gt;, and &lt;em&gt;Mountain View&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;More interestingly, let&amp;#8217;s look at where each school likes to &lt;strong&gt;eat&lt;/strong&gt;. Stereotypically, we expect Harvard, Stanford, and Berkeley students to be more outgoing and social, and MIT and Caltech students to be more introverted. This is indeed what we find:

&lt;ul&gt;
&lt;li&gt;Harvard follows &lt;em&gt;Food in New York City&lt;/em&gt;; Stanford follows &lt;em&gt;Food &amp;amp; Restaurants in Palo Alto&lt;/em&gt;, &lt;em&gt;Restaurants in San Francisco&lt;/em&gt;, and &lt;em&gt;Food in San Francisco&lt;/em&gt;; and Berkeley follows &lt;em&gt;Restaurants in San Francisco&lt;/em&gt; and &lt;em&gt;In-N-Out Burger&lt;/em&gt;. In other words, Harvard, Stanford, and Berkeley love eating out.&lt;/li&gt;
&lt;li&gt;Caltech, on the other hand, loves &lt;em&gt;Cooking&lt;/em&gt;, and MIT loves &lt;em&gt;Healthy Eating&lt;/em&gt; &amp;#8211; both signs, perhaps, of a preference for eating in.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;And what does each university use to quench their &lt;strong&gt;thirst&lt;/strong&gt;? Harvard students like to drink &lt;em&gt;wine&lt;/em&gt; (classy!), while Stanford students prefer &lt;em&gt;beer&lt;/em&gt; (the social drink of choice).&lt;/li&gt;
&lt;li&gt;What about &lt;strong&gt;sports teams&lt;/strong&gt;? MIT and Caltech couldn&amp;#8217;t care less, though Harvard follows the &lt;em&gt;Boston Red Sox&lt;/em&gt;, Stanford follows the &lt;em&gt;San Francisco Giants&lt;/em&gt; (as well as their own &lt;em&gt;Stanford Football&lt;/em&gt; and &lt;em&gt;Stanford Cardinal&lt;/em&gt;), and Berkeley follows the &lt;em&gt;Los Angeles Lakers&lt;/em&gt; (and the &lt;em&gt;California Golden Bears&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;For &lt;strong&gt;sports&lt;/strong&gt; themselves, MIT students like &lt;em&gt;skiing&lt;/em&gt;; Stanford students like &lt;em&gt;general exercise&lt;/em&gt;, &lt;em&gt;fitness&lt;/em&gt;, and &lt;em&gt;sports&lt;/em&gt;; Harvard students like &lt;em&gt;baseball&lt;/em&gt;, &lt;em&gt;ultimate frisbee&lt;/em&gt;, and &lt;em&gt;skiing&lt;/em&gt;; and Berkeley students like &lt;em&gt;snowboarding&lt;/em&gt;. Caltech, in a league of its own, enjoys &lt;em&gt;table tennis&lt;/em&gt; and &lt;em&gt;chess&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;What does each school think of &lt;strong&gt;social&lt;/strong&gt;? Caltech students look for Social &lt;em&gt;Advice&lt;/em&gt;. Berkeley students are interested in Social &lt;em&gt;Media&lt;/em&gt; and Social Media Marketing. MIT, on the more technical side, wants Social &lt;em&gt;Search&lt;/em&gt;. Stanford students, predictably, love the whole spectrum of social offerings, from Social &lt;em&gt;Dance&lt;/em&gt; and &lt;em&gt;The Social Network&lt;/em&gt;, to Social &lt;em&gt;Psychology&lt;/em&gt; and Social &lt;em&gt;Advice&lt;/em&gt;. (Interestingly, Caltech and Stanford are both interested in Social Advice, though I wonder if it&amp;#8217;s for slightly different reasons.)&lt;/li&gt;
&lt;li&gt;What&amp;#8217;s each school&amp;#8217;s relationship with &lt;strong&gt;computers&lt;/strong&gt;? Caltech students are interested in Computer &lt;em&gt;Science&lt;/em&gt;, MIT hackers are interested in Computer &lt;em&gt;Security&lt;/em&gt;, and Stanford students are interested in Human-Computer &lt;em&gt;Interaction&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Digging into the &lt;strong&gt;MIT vs. Caltech&lt;/strong&gt; divide a little, we see that Caltech students really are more interested in the pure sciences (&lt;em&gt;Physics, Science, Biology, Quantum Mechanics, Mathematics, Chemistry&lt;/em&gt;, etc.), while MIT students are more on the applied and engineering sides (&lt;em&gt;Mechanical Engineering, Engineering, Distributed Databases, Cryptography, Computer Security, Biotechnology, Operating Systems, Compilers&lt;/em&gt;, etc.).&lt;/li&gt;
&lt;li&gt;Regarding &lt;strong&gt;programming languages&lt;/strong&gt;, Caltech students love &lt;em&gt;Haskell&lt;/em&gt; (hardcore purity!), while MIT students love &lt;em&gt;Lisp&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;What does each school like to &lt;strong&gt;read&lt;/strong&gt;, both offline and online? Caltech loves &lt;em&gt;science fiction&lt;/em&gt;, &lt;em&gt;xkcd&lt;/em&gt;, and &lt;em&gt;The Onion&lt;/em&gt;; MIT likes &lt;em&gt;Hacker News&lt;/em&gt;; Harvard loves journals, newspapers, and magazines (&lt;em&gt;Huffington Post&lt;/em&gt;, the &lt;em&gt;&lt;a href=&quot;http://stuffwhitepeoplelike.com/2008/01/31/45-the-sunday-new-york-times/&quot;&gt;New York Times&lt;/a&gt;&lt;/em&gt;, &lt;em&gt;Fortune, Wall Street Journal, the New Yorker, the Economist&lt;/em&gt;, and so on); and Stanford likes &lt;em&gt;TechMeme&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;What &lt;strong&gt;movies and television shows&lt;/strong&gt; does each school like to watch? Caltech likes &lt;em&gt;Star Trek&lt;/em&gt;, the &lt;em&gt;Colbert Report&lt;/em&gt;, and &lt;em&gt;Futurama&lt;/em&gt;. MIT likes &lt;em&gt;Fight Club&lt;/em&gt; (I don&amp;#8217;t know what this has to do with MIT, though I will note that on my first day as a freshman in a new dorm, Fight Club was precisely the movie we all went to a lecture hall to see). Stanford likes &lt;em&gt;The Social Network&lt;/em&gt; and &lt;em&gt;Inception&lt;/em&gt;. Harvard, rather fittingly, likes &lt;em&gt;&lt;a href=&quot;http://stuffwhitepeoplelike.com/2009/03/11/123-mad-men/&quot;&gt;Mad Men&lt;/a&gt;&lt;/em&gt; and &lt;em&gt;&lt;a href=&quot;http://stuffwhitepeoplelike.com/2010/09/08/134-the-ted-conference/&quot;&gt;Ted Talks&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Let&amp;#8217;s look at the &lt;strong&gt;startups&lt;/strong&gt; each school follows. MIT, of course, likes &lt;em&gt;Ksplice&lt;/em&gt;. Berkeley likes &lt;em&gt;Yelp&lt;/em&gt; and &lt;em&gt;Groupon&lt;/em&gt;. Stanford likes just about every startup under the sun (&lt;em&gt;Instagram, Flipboard, Tumblr, Path, Color Labs&lt;/em&gt;, etc.). And Harvard, that bastion of hard-won influence and prestige? To the surprise of precisely no one, Harvard enjoys &lt;em&gt;Klout&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Let&amp;#8217;s end with a summarized view of each school:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Caltech&lt;/strong&gt; is very much into the sciences (&lt;em&gt;Physics, Biology, Quantum Mechanics, Mathematics&lt;/em&gt;, etc.), as well as many pretty nerdy topics (&lt;em&gt;Star Trek, Science Fiction, xkcd, Futurama, Starcraft II&lt;/em&gt;, etc.).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MIT&lt;/strong&gt; is dominated by everything engineering and tech.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stanford&lt;/strong&gt; loves relationships (&lt;em&gt;interpersonal relationships, people skills, love, network effects, sex, etiquette, dating and relationships, romance&lt;/em&gt;), health and appearance (&lt;em&gt;fashion, fitness, nutrition, happiness&lt;/em&gt;), and startups (&lt;em&gt;Instagram, Flipboard, Path, Color Labs&lt;/em&gt;, etc.).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Berkeley&lt;/strong&gt;, sadly, is perhaps too large and diverse for an overall characterization.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Harvard&lt;/strong&gt; students are fascinated by famous figures (&lt;em&gt;Jimmy Fallon, Oprah Winfrey, Invaka Trump, Dalai Lama, David Lynch, Al Gore, Bill Gates, Barack Obama&lt;/em&gt;), and by prestigious newspapers, journals, and magazines (&lt;em&gt;Fortune, the New York Times, the Wall Street Journal, the Economist&lt;/em&gt;, and so on). Other very fitting interests include &lt;em&gt;&lt;a href=&quot;http://stuffwhitepeoplelike.com/2008/01/21/12-non-profit-organizations/&quot;&gt;Kiva&lt;/a&gt;, &lt;a href=&quot;http://stuffwhitepeoplelike.com/2008/09/01/108-appearing-to-enjoy-classical-music/&quot;&gt;classical music&lt;/a&gt;&lt;/em&gt;, and &lt;em&gt;&lt;a href=&quot;http://www.vanityfair.com/online/daily/2008/06/coldplay&quot;&gt;Coldplay&lt;/a&gt;&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;*I pulled about 400 followers from each school, and added a couple filters, to try to ensure that followers were actual attendees of the schools rather than general people simply interested in them. Topics are sorted using a naive Bayes score and filtered to have at least 5 counts. Also, a word of warning: my dataset was fairly small and users on Quora are almost certainly not representative of their schools as a whole (though I tried to be rigorous with what I had).&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Information Transmission in a Social Network: Dissecting the Spread of a Quora Post</title>
    <link href="http://blog.echen.me/2011/09/07/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/"/>
    <updated>2011-09-07T11:15:11-07:00</updated>
    <id>http://blog.echen.me/2011/09/07/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post</id>
    <content type="html">&lt;p&gt;&lt;strong&gt;tl;dr&lt;/strong&gt; See &lt;a href=&quot;http://www.youtube.com/watch?v=cZ4Ntg4jQHw&quot;&gt;this movie visualization&lt;/a&gt; for a case study on how a post propagates through Quora.&lt;/p&gt;

&lt;p&gt;How does information spread through a network? Much of Quora&amp;#8217;s appeal, after all, lies in its social graph &amp;#8211; and when you&amp;#8217;ve got a network of users, all broadcasting their activities to their neighbors, information can cascade in multiple ways. How do these social designs affect which users see what?&lt;/p&gt;

&lt;p&gt;Think, for example, of what happens when your kid learns a new slang word at school. He doesn&amp;#8217;t confine his use of the word to McKinley Elementary&amp;#8217;s particular boundaries, between the times of 9-3pm &amp;#8211; he introduces it to his friends from other schools at soccer practice as well. A couple months later, he even says it at home for the first time; you like the word so much, you then start using it at work. Eventually, Justin Bieber uses the word in a song, at which point the word&amp;#8217;s popularity really starts to explode.&lt;/p&gt;

&lt;p&gt;So how does information propagate through a social network? What types of people does an answer on Quora reach, and how does it reach them? (Do users discover new answers individually, or are hubs of connectors more key?) How does the activity of a post on Quora rise and fall? (Submissions on other sites have limited lifetimes, fading into obscurity soon after an initial spike; how does that change when users are connected and every upvote can revive a post for someone else&amp;#8217;s eyes?)&lt;/p&gt;

&lt;p&gt;(I looked at Quora since I had some data from there already available, but I hope the lessons should be fairly applicable in general, to other social networks like Facebook, Twitter, and LinkedIn as well.)&lt;/p&gt;

&lt;p&gt;To give an initial answer to some of these questions, I dug into one of my more popular posts, on &lt;a href=&quot;http://www.quora.com/Random-Forests/How-do-random-forests-work-in-laymans-terms&quot;&gt;a layman&amp;#8217;s introduction to random forests&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;Users, Topics&lt;/h1&gt;

&lt;p&gt;Before looking deeper into the voting dynamics of the post, let&amp;#8217;s first get some background on what kinds of users the answer reached.&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s a graph of the topics that question upvoters follow. (Each node is a topic, and every time upvoter X follows both topics A and B, I add an edge between A and B.)&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/social-network-transmission/rf-upvoter-topics-unlabeled.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/social-network-transmission/rf-upvoter-topics-unlabeled.png&quot; alt=&quot;Upvoters&#8217; Topics - Unlabeled&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/social-network-transmission/rf-upvoter-topics-labeled.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/social-network-transmission/rf-upvoter-topics-labeled.png&quot; alt=&quot;Upvoters&#8217; Topics - Labeled&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see from the graph that upvoters tend to be interested in three kinds of topics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Machine learning and other technical matters&lt;/strong&gt; (the green cluster): Classification, Data Mining, Big Data, Information Retrieval, Analytics, Probability, Support Vector Machines, R, Data Science, &amp;#8230;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Startups/Silicon Valley&lt;/strong&gt; (the red cluster): Facebook, Lean Startups, Investing, Seed Funding, Angel Investing, Technology Trends, Product Managment, Silicon Valley Mergers and Acquisitions, Asana, Social Games, Quora, Mark Zuckerberg, User Experience, Founders and Entrepreneurs, &amp;#8230;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;General Intellectual Topics&lt;/strong&gt; (the purple cluster): TED, Science, Book Recommendations, Philosophy, Politics, Self-Improvement, Travel, Life Hacks, &amp;#8230;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Also, here&amp;#8217;s the network of the upvoters themselves (there&amp;#8217;s an edge between users A and B if A follows B):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/social-network-transmission/rf-upvoters-unlabeled.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/social-network-transmission/rf-upvoters-unlabeled.png&quot; alt=&quot;Upvote Network - Unlabeled&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/social-network-transmission/rf-upvoters-labeled.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/social-network-transmission/rf-upvoters-labeled.png&quot; alt=&quot;Upvote Network - Labeled&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see three main clusters of users:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A large group in &lt;strong&gt;green&lt;/strong&gt; centered around a lot of power users and Quora employees.&lt;/li&gt;
&lt;li&gt;A machine learning group of folks in &lt;strong&gt;orange&lt;/strong&gt; centered around people like Oliver Grisel, Christian Langreiter, and Joseph Turian.&lt;/li&gt;
&lt;li&gt;A group of people following me, in &lt;strong&gt;purple&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Plus some smaller clusters in blue and yellow. (There were also a bunch of isolated users, connected to no one, that I filtered out of the picture.)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Digging into how these topic and user graphs are related:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The orange cluster of users is more heavily into machine learning: 79% of users in that cluster follow more green topics (machine learning and technical topics) than red and purple topics (startups and general intellectual matters).&lt;/li&gt;
&lt;li&gt;The green cluster of users is reversed: 77% of users follow more of the red and purple clusters of topics (on startups and general intellectual matters) than machine learning and technical topics.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;More interestingly, though, we can ask: how do the connections between upvoters relate to the way the post spread?&lt;/p&gt;

&lt;h1&gt;Social Voting Dynamics&lt;/h1&gt;

&lt;p&gt;So let&amp;#8217;s take a look. Here&amp;#8217;s a visualization I made of upvotes on my answer across time (click &lt;a href=&quot;http://www.youtube.com/watch?v=cZ4Ntg4jQHw&quot;&gt;here&lt;/a&gt; for a larger view).&lt;/p&gt;

&lt;iframe width=&quot;640&quot; height=&quot;510&quot; src=&quot;http://www.youtube.com/embed/cZ4Ntg4jQHw &quot; frameborder=&quot;0&quot; allowfullscreen&gt;&lt;/iframe&gt;


&lt;p&gt;&lt;/p&gt;

&lt;p&gt;To represent the social dynamics of these upvotes, I drew an edge from user A to user B if user A transmitted the post to user B through an upvote. (Specifically, I drew an edge from Alice to Bob if Bob follows Alice and Bob&amp;#8217;s upvote appeared within five days of Alice&amp;#8217;s upvote; this is meant to simulate the idea that Alice was the key intermediary between my post and Bob.)&lt;/p&gt;

&lt;p&gt;Also,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Green nodes are users with at least one upvote edge.&lt;/li&gt;
&lt;li&gt;Blue nodes are users who follow at least one of the topics the post is categorized under (i.e., users who probably discovered the answer by themselves).&lt;/li&gt;
&lt;li&gt;Red nodes are users with no connections and who do not follow any of the post&amp;#8217;s topics (i.e, users whose path to the post remain mysterious).&lt;/li&gt;
&lt;li&gt;Users increase in size when they produce more connections.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Here&amp;#8217;s a play-by-play of the video:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On Feb 14 (the day I wrote the answer), there&amp;#8217;s a flurry of activity.&lt;/li&gt;
&lt;li&gt;A couple of days later, Tracy Chou gives an upvote, leading to another spike in activity.&lt;/li&gt;
&lt;li&gt;Then all&amp;#8217;s quiet until&amp;#8230; bam! Alex Kamil leads to a surge of upvotes, and his upvote finds Ludi Rehak, who starts a small surge of her own. They&amp;#8217;re quickly followed by Christian Langreiter, who starts a small revolution among a bunch of machine learning folks a couple days later.&lt;/li&gt;
&lt;li&gt;Then all is pretty calm again, until a couple months later when&amp;#8230; bam! Aditya Sengupta brings in a smashing of his own followers, and his upvote makes its way to Marc Bodnick, who sets off a veritable storm of activity.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;(Already we can see some relationships between the graph of user connections and the way the post propagated. Many of the users from the orange cluster, for example, come from Alex Kamil and Christian Langreiter&amp;#8217;s upvotes, and many of the users from the green cluster come from Aditya Sengupta and Marc Bodnick&amp;#8217;s upvotes. What&amp;#8217;s interesting, though, is, why didn&amp;#8217;t the cluster of green users appear all at once, like the orange cluster did? People like Kah Seng Tay, Tracy Chou, Venkatesh Rao, and Chad Little upvoted the answer pretty early on, but it wasn&amp;#8217;t until Aditya Sengupta&amp;#8217;s upvote a couple months later that people like Marc Bodnick, Edmond Lau, and many of the other green users (who do indeed follow that first set of folks) discovered the answer. Did the post simply get lost in users&amp;#8217; feeds the first time around? Was the post perhaps ignored until it received enough upvotes to be considered worth reading? Are some users&amp;#8217; upvotes just trusted more than others&amp;#8217;?)&lt;/p&gt;

&lt;p&gt;For another view of the upvote dynamics, here&amp;#8217;s a static visualization, where we can again easily see the clusters of activity:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/social-network-transmission/upvote-clusters-labeled-v2.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/social-network-transmission/upvote-clusters-labeled-v2.png&quot; alt=&quot;Upvote Temporal Clusters&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;Fin&lt;/h1&gt;

&lt;p&gt;There are still many questions it would be interesting to look at; for example,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What differentiates users who sparked spikes of activity from users who didn&amp;#8217;t? I don&amp;#8217;t believe it&amp;#8217;s simply number of followers, as many well-connected upvoters did &lt;em&gt;not&lt;/em&gt; lead to cascades of shares. Does authority matter?&lt;/li&gt;
&lt;li&gt;How far can a post reach? Clearly, the post reached people more than one degree of separation away from me (where one degree of separation is a follower); what does the distribution of degrees look like? Is there any relationship between degree of separation and time of upvote?&lt;/li&gt;
&lt;li&gt;What can we say about the people who started following me after reading my answer? Are they fewer degrees of separation away? Are they more interested in machine learning? Have they upvoted any of my answers before? (Perhaps there&amp;#8217;s a certain &amp;#8220;threshold&amp;#8221; of interestingness people need to overflow before they&amp;#8217;re considered acceptable followees.)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;But to summarize a bit what we&amp;#8217;ve seen so far, here are some statistics on the role the social graph played in spreading the post:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There are 5 clusters of activity after the initial post, sparked both by power users and less-connected folks. In an interesting cascade of information, some of these sparks led to further spikes in activity as well (as when Aditya Sengupta&amp;#8217;s upvote found its way to Marc Bodnick, who set off even more activity).&lt;/li&gt;
&lt;li&gt;35% of users made their way to my answer because of someone else&amp;#8217;s upvote.&lt;/li&gt;
&lt;li&gt;Through these connections, the post reached a fair variety of users: 32% of upvoters don&amp;#8217;t even follow any of the post&amp;#8217;s topics.&lt;/li&gt;
&lt;li&gt;77% of upvotes came from users over two weeks &lt;em&gt;after&lt;/em&gt; my answer appeared.&lt;/li&gt;
&lt;li&gt;If we look only at the upvoters who follow at least one of the post&amp;#8217;s topics, 33% didn&amp;#8217;t see my answer until someone else showed it to them. In other words, a full one-third of people who presumably would have been interested in my post anyways only found it because of their social network.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So it looks like the social graph played quite a large part in the post&amp;#8217;s propagation, and I&amp;#8217;ll end with a big shoutout to Stormy Shippy, who provided an awesome set of scripts I used to collect a lot of this data.&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Introduction to Latent Dirichlet Allocation</title>
    <link href="http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/"/>
    <updated>2011-08-22T10:50:49-07:00</updated>
    <id>http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation</id>
    <content type="html">&lt;h1&gt;Introduction&lt;/h1&gt;

&lt;p&gt;Suppose you have the following set of sentences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I like to eat broccoli and bananas.&lt;/li&gt;
&lt;li&gt;I ate a banana and spinach smoothie for breakfast.&lt;/li&gt;
&lt;li&gt;Chinchillas and kittens are cute.&lt;/li&gt;
&lt;li&gt;My sister adopted a kitten yesterday.&lt;/li&gt;
&lt;li&gt;Look at this cute hamster munching on a piece of broccoli.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;What is latent Dirichlet allocation? It&amp;#8217;s a way of automatically discovering &lt;strong&gt;topics&lt;/strong&gt; that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Sentences 1 and 2&lt;/strong&gt;: 100% Topic A&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sentences 3 and 4&lt;/strong&gt;: 100% Topic B&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Sentence 5&lt;/strong&gt;: 60% Topic A, 40% Topic B&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Topic A&lt;/strong&gt;: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, &amp;#8230; (at which point, you could interpret topic A to be about food)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Topic B&lt;/strong&gt;: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, &amp;#8230; (at which point, you could interpret topic B to be about cute animals)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The question, of course, is: how does LDA perform this discovery?&lt;/p&gt;

&lt;h1&gt;LDA Model&lt;/h1&gt;

&lt;p&gt;In more detail, LDA represents documents as &lt;strong&gt;mixtures of topics&lt;/strong&gt; that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Decide on the number of words N the document will have (say, according to a Poisson distribution).&lt;/li&gt;
&lt;li&gt;Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.&lt;/li&gt;
&lt;li&gt;Generate each word w_i in the document by:

&lt;ul&gt;
&lt;li&gt;First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).&lt;/li&gt;
&lt;li&gt;Using the topic to generate the word itself (according to the topic&amp;#8217;s multinomial distribution). For example, if we selected the food topic, we might generate the word &amp;#8220;broccoli&amp;#8221; with 30% probability, &amp;#8220;bananas&amp;#8221; with 15% probability, and so on.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.&lt;/p&gt;

&lt;h2&gt;Example&lt;/h2&gt;

&lt;p&gt;Let&amp;#8217;s make an example. According to the above process, when generating some particular document D, you might&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pick 5 to be the number of words in D.&lt;/li&gt;
&lt;li&gt;Decide that D will be 1/2 about food and 1/2 about cute animals.&lt;/li&gt;
&lt;li&gt;Pick the first word to come from the food topic, which then gives you the word &amp;#8220;broccoli&amp;#8221;.&lt;/li&gt;
&lt;li&gt;Pick the second word to come from the cute animals topic, which gives you &amp;#8220;panda&amp;#8221;.&lt;/li&gt;
&lt;li&gt;Pick the third word to come from the cute animals topic, giving you &amp;#8220;adorable&amp;#8221;.&lt;/li&gt;
&lt;li&gt;Pick the fourth word to come from the food topic, giving you &amp;#8220;cherries&amp;#8221;.&lt;/li&gt;
&lt;li&gt;Pick the fifth word to come from the food topic, giving you &amp;#8220;eating&amp;#8221;.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So the document generated under the LDA model will be &amp;#8220;broccoli panda adorable cherries eating&amp;#8221; (note that LDA is a bag-of-words model).&lt;/p&gt;

&lt;h1&gt;Learning&lt;/h1&gt;

&lt;p&gt;So now suppose you have a set of documents. You&amp;#8217;ve chosen some fixed number of K topics to discover, and want to use LDA to learn the topic representation of each document and the words associated to each topic. How do you do this? One way (known as collapsed Gibbs sampling) is the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Go through each document, and randomly assign each word in the document to one of the K topics.&lt;/li&gt;
&lt;li&gt;Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).&lt;/li&gt;
&lt;li&gt;So to improve on them, for each document d&amp;#8230;

&lt;ul&gt;
&lt;li&gt;Go through each word w in d&amp;#8230;

&lt;ul&gt;
&lt;li&gt;And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word&amp;#8217;s topic with this probability). (Also, I&amp;#8217;m glossing over a couple of things here, in particular the use of priors/pseudocounts in these probabilities.)&lt;/li&gt;
&lt;li&gt;In other words, in this step, we&amp;#8217;re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;After repeating the previous step a large number of times, you&amp;#8217;ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;Layman&amp;#8217;s Explanation&lt;/h1&gt;

&lt;p&gt;In case the discussion above was a little eye-glazing, here&amp;#8217;s another way to look at LDA in a different domain.&lt;/p&gt;

&lt;p&gt;Suppose you&amp;#8217;ve just moved to a new city. You&amp;#8217;re a hipster and an anime fan, so you want to know where the other hipsters and anime geeks tend to hang out. Of course, as a hipster, you know you can&amp;#8217;t just &lt;em&gt;ask&lt;/em&gt;, so what do you do?&lt;/p&gt;

&lt;p&gt;Here&amp;#8217;s the scenario: you scope out a bunch of different establishments (&lt;strong&gt;documents&lt;/strong&gt;) across town, making note of the people (&lt;strong&gt;words&lt;/strong&gt;) hanging out in each of them (e.g., Alice hangs out at the mall and at the park, Bob hangs out at the movie theater and the park, and so on). Crucially, you don&amp;#8217;t know the typical interest groups (&lt;strong&gt;topics&lt;/strong&gt;) of each establishment, nor do you know the different interests of each person.&lt;/p&gt;

&lt;p&gt;So you pick some number K of categories to learn (i.e., you want to learn the K most important kinds of categories people fall into), and start by making a guess as to why you see people where you do. For example, you initially guess that Alice is at the mall because people with interests in X like to hang out there; when you see her at the park, you guess it&amp;#8217;s because her friends with interests in Y like to hang out there; when you see Bob at the movie theater, you randomly guess it&amp;#8217;s because the Z people in this city really like to watch movies; and so on.&lt;/p&gt;

&lt;p&gt;Of course, your random guesses are very likely to be incorrect (they&amp;#8217;re random guesses, after all!), so you want to improve on them. One way of doing so is to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pick a place and a person (e.g., Alice at the mall).&lt;/li&gt;
&lt;li&gt;Why is Alice likely to be at the mall? Probably because other people at the mall with the same interests sent her a message telling her to come.&lt;/li&gt;
&lt;li&gt;In other words, the more people with interests in X there are at the mall and the stronger Alice is associated with interest X (at all the other places she goes to), the more likely it is that Alice is at the mall because of interest X.&lt;/li&gt;
&lt;li&gt;So make a new guess as to why Alice is at the mall, choosing an interest with some probability according to how likely you think it is.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Go through each place and person over and over again. Your guesses keep getting better and better (after all, if you notice that lots of geeks hang out at the bookstore, and you suspect that Alice is pretty geeky herself, then it&amp;#8217;s a good bet that Alice is at the bookstore because her geek friends told her to go there; and now that you have a better idea of why Alice is probably at the bookstore, you can use this knowledge in turn to improve your guesses as to why everyone else is where they are), and eventually you can stop updating. Then take a snapshot (or multiple snapshots) of your guesses, and use it to get all the information you want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For each category, you can count the people assigned to that category to figure out what people have this particular interest. By looking at the people themselves, you can interpret the category as well (e.g., if category X contains lots of tall people wearing jerseys and carrying around basketballs, you might interpret X as the &amp;#8220;basketball players&amp;#8221; group).&lt;/li&gt;
&lt;li&gt;For each place P and interest category C, you can compute the proportions of people at P because of C (under the current set of assignments), and these give you a representation of P. For example, you might learn that the people who hang out at Barnes &amp;amp; Noble consist of 10% hipsters, 50% anime fans, 10% jocks, and 30% college students.&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;Real-World Example&lt;/h1&gt;

&lt;p&gt;Finally, I applied LDA to a set of Sarah Palin&amp;#8217;s emails a little while ago (see &lt;a href=&quot;http://blog.echen.me/2011/06/27/topic-modeling-the-sarah-palin-emails/&quot;&gt;here&lt;/a&gt; for the blog post, or &lt;a href=&quot;http://sarah-palin.heroku.com/&quot;&gt;here&lt;/a&gt; for an app that allows you to browse through the emails by the LDA-learned categories), so let&amp;#8217;s give a brief recap. Here are some of the topics that the algorithm learned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Trig/Family/Inspiration&lt;/strong&gt;: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, &amp;#8230;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Wildlife/BP Corrosion&lt;/strong&gt;: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, &amp;#8230;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Energy/Fuel/Oil/Mining:&lt;/strong&gt; energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, &amp;#8230;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gas&lt;/strong&gt;: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, &amp;#8230;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Education/Waste&lt;/strong&gt;: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, &amp;#8230;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Presidential Campaign/Elections&lt;/strong&gt;: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, &amp;#8230;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Here&amp;#8217;s an example of an email which fell 99% into the Trig/Family/Inspiration category (particularly representative words are highlighted in blue):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/palin-browser/trig-email.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/palin-browser/trig-email.png&quot; alt=&quot;Trig Email&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And here&amp;#8217;s an excerpt from an email which fell 10% into the Presidential Campaign/Election category (in red) and 90% into the Wildlife/BP Corrosion category (in green):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/palin-browser/wildlife-presidency-email.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/palin-browser/wildlife-presidency-email.png&quot; alt=&quot;Wildlife-Presidency Email&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Tweets vs. Likes: What gets shared on Twitter vs. Facebook?</title>
    <link href="http://blog.echen.me/2011/07/28/tweets-vs-likes-what-gets-shared-on-twitter-vs-facebook/"/>
    <updated>2011-07-28T07:55:29-07:00</updated>
    <id>http://blog.echen.me/2011/07/28/tweets-vs-likes-what-gets-shared-on-twitter-vs-facebook</id>
    <content type="html">&lt;p&gt;It always strikes me as curious that some posts get a lot of love on Twitter, while others get many more shares on Facebook:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/twitter-beats-fb.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/twitter-beats-fb.png&quot; alt=&quot;Twitter Beats FB&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/fb-beats-twitter.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/fb-beats-twitter.png&quot; alt=&quot;FB Beats Twitter&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What accounts for this difference? Some of it is surely site-dependent: maybe one blogger has a Facebook page but not a Twitter account, while another has these roles reversed. But even on sites maintained by a single author, tweet-to-likes ratios can vary widely from post to post.&lt;/p&gt;

&lt;p&gt;So what kinds of articles tend to be more popular on Twitter, and which spread more easily on Facebook? To take a stab at an answer, I scraped data from a couple of websites over the weekend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;tl;dr&lt;/strong&gt; Twitter is still for the &lt;em&gt;techies&lt;/em&gt;: articles where the number of tweets greatly outnumber FB likes tend to revolve around software companies and programming. Facebook, on the other hand, appeals to &lt;em&gt;everyone else&lt;/em&gt;: yeah, to the masses, and to non-software technical folks in general as well.&lt;/p&gt;

&lt;h1&gt;FlowingData&lt;/h1&gt;

&lt;p&gt;The first site I looked at was Nathan Yau&amp;#8217;s awesome &lt;a href=&quot;http://www.flowingdata.com&quot;&gt;FlowingData&lt;/a&gt; website on data visualization. To see which articles are more popular on Facebook and which are more popular on Twitter, let&amp;#8217;s sort all the FlowingData articles by their # tweets / # likes ratio.&lt;/p&gt;

&lt;p&gt;Here are the 10 posts with the lowest tweets-to-likes ratio (i.e., the posts that were especially popular with Facebook users):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/flowingdata-facebook2.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/flowingdata-facebook2-small.png&quot; alt=&quot;FlowingData Facebook&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/01/30/what-your-state-is-the-worst-at-united-states-of-shame/&quot;&gt;What your state is the worst at – United States of shame&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/05/13/plush-statistical-distribution-pillows/&quot;&gt;Plush statistical distribution pillows&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/03/22/are-gas-prices-really-that-high/&quot;&gt;Are gas prices really that high?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/01/21/hey-jude-flowchart/&quot;&gt;Hey Jude flowchart&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/04/28/womens-dress-sizes-demystified/&quot;&gt;Women’s dress sizes demystified&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/02/22/america-is-not-the-best-at-everything/&quot;&gt;America is not the best at everything&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/06/10/what-you-need-to-get-together/&quot;&gt;What you need to get together&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/01/27/dexters-victims-through-season-five/&quot;&gt;Dexter’s victims through season five&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/05/06/correlating-dog/&quot;&gt;Correlating dog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/02/14/valentines-day-importance/&quot;&gt;Valentine’s Day importance&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;And here are the 10 posts with the highest tweets-to-like ratio (i.e., the posts especially popular with Twitter users):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/flowingdata-twitter.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/flowingdata-twitter-small.png&quot; alt=&quot;FlowingData Twitter&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/01/04/delicious-mass-exodus/&quot;&gt;Delicious mass exodus&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/05/25/pew-research-raw-survey-data-now-available/&quot;&gt;Pew Research raw survey data now available&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/02/03/stock-market-predictions-with-twitter/&quot;&gt;Stock market predictions with Twitter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/01/25/growth-and-usage-of-foursquare-in-2010/&quot;&gt;Growth and usage of foursquare in 2010&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/02/17/sunlight-labs-opens-up-real-time-congress-api/&quot;&gt;Sunlight Labs opens up Real Time Congress API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/03/25/open-source-data-science-toolkit/&quot;&gt;Open-source Data Science Toolkit&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/01/24/explore-your-linkedin-network-visually-with-inmaps/&quot;&gt;Explore your LinkedIn network visually with InMaps&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/05/03/perceived-vs-actual-country-rankings/&quot;&gt;Perceived vs. actual country rankings&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/04/20/see-what-you-and-others-tweet-about-with-the-topic-explorer/&quot;&gt;See what you and others tweet about with the Topic Explorer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/04/24/history-of-detainees-at-guantnamo/&quot;&gt;History of detainees at Guantánamo&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Notice any differences between the two?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instant gratification infographics, cuteness, comics, and pop culture get liked on Facebook.&lt;/li&gt;
&lt;li&gt;APIs, datasets, visualizations related to techie sites (Delicious, foursquare, Twitter, LinkedIn), and picture-less articles get tweeted instead.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Interestingly, it also looks like the colors in the top 10 Facebook articles tend to the red end of the spectrum, while the colors in the top 10 Twitter articles tend to the blue end of the spectrum. Does this pattern hold if we look at more data? Here&amp;#8217;s a meta-visualization of the FlowingData articles, sorted by articles popular on Facebook in the top left to articles popular on Twitter in the bottom right (see &lt;a href=&quot;http://flowingdata-melted.heroku.com/&quot;&gt;here&lt;/a&gt; for some interactivity and more details):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://flowingdata-melted.heroku.com/&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/flowingdata-metaviz/flowingdata-metaviz.png&quot; alt=&quot;FlowingData MetaViz&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It does indeed look like the images at the top (the articles popular on Facebook) are more pink, while the images at the bottom (the articles popular on Twitter) are more blue (though it would be nice to quantify this in some way)!&lt;/p&gt;

&lt;p&gt;Furthermore, we can easily see from the grid that articles with no visualizations (represented by lorem ipsum text in the grid) cluster at the bottom. Grabbing some actual numbers, we find that 32% of articles with at least one picture have more shares on Facebook than on Twitter, compared to only 4% of articles with no picture at all.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/flowingdata-viz-effect.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/flowingdata-viz-effect.png&quot; alt=&quot;Effect of a visualization&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Finally, let&amp;#8217;s break down the percentage of articles with more Facebook shares by category.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/flowingdata-categories.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/flowingdata-categories.png&quot; alt=&quot;FlowingData Categories&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;(I filtered the categories so that each category in the plot above contains at least 5 articles.)&lt;/p&gt;

&lt;p&gt;What do we find?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Articles in the Software, Online Applications, News, and Data sources categories (yawn) get 100% of their shares from Twitter.&lt;/li&gt;
&lt;li&gt;Articles tagged with &lt;a href=&quot;http://flowingdata.com/category/projects/data-underload/&quot;&gt;Data Underload&lt;/a&gt; (which seems to contain short and sweet visualizations of everyday things), &lt;a href=&quot;http://flowingdata.com/category/miscellaneous-data/&quot;&gt;Miscellaneous&lt;/a&gt; (which contains lots of comics or comic-like visualizations), and &lt;a href=&quot;http://flowingdata.com/category/visualization/infographics/&quot;&gt;Infographics&lt;/a&gt; get the most shares on Facebook.&lt;/li&gt;
&lt;li&gt;This category breakdown matches precisely what we saw in the top 10 examples above.&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;New Scientist&lt;/h1&gt;

&lt;p&gt;When looking at FlowingData, we saw that Twitter users are much bigger on sharing technical articles. But is this true for technical articles in general, or only for programming-related posts? (In my experience with Twitter, I haven&amp;#8217;t seen many people from math and the non-computer sciences.)&lt;/p&gt;

&lt;p&gt;To answer, I took articles from the &lt;a href=&quot;http://www.newscientist.com/search?rbsection1=Physics+%26+Math&amp;amp;sortby=rbpubdate&quot;&gt;Physics &amp;amp; Math&lt;/a&gt; and &lt;a href=&quot;http://www.newscientist.com/search?rbsection1=tech&amp;amp;sortby=rbpubdate&quot;&gt;Technology&lt;/a&gt; sections of &lt;a href=&quot;http://www.newscientist.com&quot;&gt;New Scientist&lt;/a&gt;, and&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Calculated the percentage of shares each article received on Twitter (i.e., # tweets / (# tweets + # likes)).&lt;/li&gt;
&lt;li&gt;Grouped articles by their number of tweets rounded to the nearest multiple of 25 (bin #1 contains articles close to 25 tweets, bin #2 contains articles close to 50 tweets, etc.).&lt;/li&gt;
&lt;li&gt;Calculated the median percentage of shares on Twitter for each bin.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Here&amp;#8217;s a graph of the result:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/tech_vs_physicsmath.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/tech_vs_physicsmath.png&quot; alt=&quot;Technology vs. Physics &amp;amp; Math&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The technology articles get consistently more shares from Twitter than the physics and math articles do.&lt;/li&gt;
&lt;li&gt;Twitter accounts for the majority of the technology shares.&lt;/li&gt;
&lt;li&gt;Facebook accounts for the majority of the physics and math shares.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So this suggests that Twitter really is for computer technology in particular, not technical matters in general (though it would be nice to look at areas other than physics and math as well).&lt;/p&gt;

&lt;h1&gt;Quora&lt;/h1&gt;

&lt;p&gt;To get some additional evidence on the computer science vs. math/physics divide, I&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scraped about 350 profiles of followers from each of the Computer Science, Software Engineering, Mathematics, and Physics categories on Quora;&lt;/li&gt;
&lt;li&gt;Checked each user to see whether they link to their Facebook and Twitter accounts on their profile.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Here&amp;#8217;s the ratio of the number of people linking to their Facebook account to the number of people linking to their Twitter account, sliced by topic:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/math-physics-vs-cs-software.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/math-physics-vs-cs-software.png&quot; alt=&quot;Math/Physics vs. CS/Software&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/math-physics-vs-cs-software-collapsed.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/math-physics-vs-cs-software-collapsed.png&quot; alt=&quot;Math/Physics vs. CS/Software, Collapsed&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We find exactly what we expect from the New Scientist data: people following the math and physics categories have noticeably smaller Twitter / Facebook ratios compared to people following the computer science and software engineering categories (i.e., compared to computer scientists and software engineers, mathematicians and physicists are more likely to be on Facebook than on Twitter). What&amp;#8217;s more, this difference is in fact significant: the graphs display individual 90% confidence intervals (which overlap not at all or only slightly), and we do indeed get significance at the 95% level if we look at the differences between categories.&lt;/p&gt;

&lt;p&gt;This corroborates the New Scientist evidence that Twitter gets the computer technology shares, while Facebook gets the math and physics shares.&lt;/p&gt;

&lt;h1&gt;XKCD&lt;/h1&gt;

&lt;p&gt;Finally, let&amp;#8217;s take a look at which XKCD comics are especially popular on Facebook vs. Twitter.&lt;/p&gt;

&lt;p&gt;Here are the 10 comics with the highest likes-to-tweets ratio (i.e., the comics especially popular on Facebook):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/xkcd-facebook.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/xkcd-facebook-small.png&quot; alt=&quot;XKCD Facebook&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/846/&quot;&gt;Dental Nerve&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/861/&quot;&gt;Wisdom Teeth&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/876/&quot;&gt;Trapped&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/849/&quot;&gt;Complex Conjugate&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/854/&quot;&gt;Learning to Cook&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/840/&quot;&gt;Serious&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/839/&quot;&gt;Explorers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/815/&quot;&gt;Mu&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/809/&quot;&gt;Los Alamos&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/911/&quot;&gt;Magic School Bus&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Here are the 10 comics with the highest tweets-to-likes ratio (i.e., the comics especially popular on Twitter):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/xkcd-twitter.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/xkcd-twitter-small.png&quot; alt=&quot;XKCD Twitter&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/869/&quot;&gt;Server Attention Span&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/818/&quot;&gt;Illness&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/865/&quot;&gt;Nanobots&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/912/&quot;&gt;Manual Override&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/908/&quot;&gt;The Cloud&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/810/&quot;&gt;Constructive&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/887/&quot;&gt;Future Timeline&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/844/&quot;&gt;Good Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/801/&quot;&gt;Golden Hammer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/906/&quot;&gt;Advertising Discovery&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://xkcd.com/802/&quot;&gt;Online Communities 2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Note that the XKCD comics popular on Facebook have more of a layman flavor, while the XKCD comics popular on Twitter are much more programming-related:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Of the XKCD comics popular on Twitter, one&amp;#8217;s about server attention spans, another&amp;#8217;s about IPv6 addresses, a third is about GNU info pages, another deals with cloud computing, a fifth talks about Java, and the last is about a bunch of techie sites. (This is just like what we saw with the FlowingData visualizations.)&lt;/li&gt;
&lt;li&gt;Facebook, on the other hand, gets Ke$ha and Magic School Bus.&lt;/li&gt;
&lt;li&gt;And while both top 10&amp;#8217;s contain a flowchart, the one popular on FB is about &lt;em&gt;cooking&lt;/em&gt;, while the one popular on Twitter is about &lt;em&gt;code&lt;/em&gt;!&lt;/li&gt;
&lt;li&gt;What&amp;#8217;s more, if we look at the few technical-ish comics that are more popular on Facebook (the complex conjugate, mu, and Los Alamos comics), we see that they&amp;#8217;re about physics and math, not programming (which matches our findings from the New Scientist articles).&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;Lesson&lt;/h1&gt;

&lt;p&gt;So why should you care? Here&amp;#8217;s one takeaway:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you&amp;#8217;re blogging about technology, programming, and computer science, Twitter is your friend.&lt;/li&gt;
&lt;li&gt;But if you&amp;#8217;re blogging about anything else, be it math/physics or pop culture, don&amp;#8217;t rely on a Twitter account alone; your shares are more likely to propagate on Facebook, so make sure to have a Facebook page as well.&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;What&amp;#8217;s Next?&lt;/h1&gt;

&lt;p&gt;The three websites I looked at are all fairly tech-oriented, so it would be nice to gather data from other kinds of websites as well.&lt;/p&gt;

&lt;p&gt;And now that we have an idea how Twitter and Facebook compare, the next burning question is surely: &lt;a href=&quot;http://finalbossform.com/post/7214184180/google-is-fast-becoming-the-leading-social&quot;&gt;what do people share on Google+?!&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;Addendum&lt;/h1&gt;

&lt;p&gt;Let&amp;#8217;s consider the following thought experiment. Suppose you come across the most unpopular article ever written. What will its FB vs. Twitter shares look like? Although no &lt;em&gt;real&lt;/em&gt; person will ever share this article, I think Twitter has many more spambots (who tweet out any and every link) than FB does, so maybe unpopular articles will have more tweets than likes by default. Conversely, suppose you come across the most popular article ever written, which everybody wants to share. Then since FB has many more users than Twitter does, maybe popular articles will tend to have more likes than tweets anyways.&lt;/p&gt;

&lt;p&gt;Thus, in order to find out which types of articles are &lt;em&gt;especially&lt;/em&gt; popular on FB vs. Twitter, instead of looking at tweets-to-likes ratios directly, we could try to remove this baseline popularity effect. (Taking ratios instead of raw number of tweets or raw number of likes is one kind of normalization; this is another.)&lt;/p&gt;

&lt;p&gt;So does this scenario (or something similar to it) actually play out in practice?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/flowingdata-overall-popularity-vs-fb.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/flowingdata-overall-popularity-vs-fb.png&quot; alt=&quot;Overall Popularity vs. Facebook&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here I&amp;#8217;ve plotted the overall popularity of a post (the total number of shares it received on either Twitter or FB) against the percentage of shares on Facebook alone, and we can see that as a post&amp;#8217;s popularity grows, more and more shares do indeed tend to come from Facebook rather than Twitter.&lt;/p&gt;

&lt;p&gt;Also, see the posts at the lower end of the popularity scale that are only getting shares on Twitter? Let&amp;#8217;s take a look at the five most unpopular of these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/03/31/flowingdata-is-brought-to-you-by-8/&quot;&gt;Flowing Data is brought to you by&amp;#8230; (March 2011 edition)&lt;/a&gt; (11 tweets, 0 likes)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/07/05/flowingdata-is-brought-to-you-by-11/&quot;&gt;Flowing Data is brought to you by&amp;#8230; (July 2011 edition)&lt;/a&gt; (14 tweets, 0 likes)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/06/06/flowingdata-is-brought-to-you-by-10/&quot;&gt;Flowing Data is brought to you by&amp;#8230; (June 2011 edition)&lt;/a&gt; (17 tweets, 0 likes)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/05/09/flowingdata-is-brought-to-you-by-9/&quot;&gt;Flowing Data is brought to you by&amp;#8230; (May 2011 edition)&lt;/a&gt; (18 tweets, 0 likes)&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/02/28/flowingdata-is-brought-to-you-by-7/&quot;&gt;Flowing Data is brought to you by&amp;#8230; (May 2011 edition)&lt;/a&gt; (12 tweets, 1 like)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Notice that they&amp;#8217;re all shoutouts to FlowingData&amp;#8217;s sponsors! There&amp;#8217;s pretty much no reason any &lt;em&gt;real&lt;/em&gt; person would share these on Twitter or Facebook, and indeed, checking Twitter to see who actually tweeted out these links, we see that the tweeters are bots:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://twitter.com/#!/myVisualization/status/77685824224894976&quot;&gt;https://twitter.com/#!/myVisualization/status/77685824224894976&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://twitter.com/#!/InfographicTwts/status/67668615142457344&quot;&gt;https://twitter.com/#!/InfographicTwts/status/6766861514245734&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://twitter.com/#!/guysgoogle/status/77644902510493696&quot;&gt;https://twitter.com/#!/guysgoogle/status/77644902510493696&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://twitter.com/#!/WhereIsYourData/status/77631743292735488&quot;&gt;https://twitter.com/#!/WhereIsYourData/status/77631743292735488&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Now let&amp;#8217;s switch to a slightly different view of the above scenario, where I plot number of tweets against number of likes:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/flowingdata-tweets-vs-likes.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/flowingdata-tweets-vs-likes.png&quot; alt=&quot;FlowingData Tweets vs. Likes&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We see that as popularity on Twitter increases, so too does popularity on Facebook &amp;#8211; but at a slightly faster rate. (The form of the blue line plotted is roughly $\log(likes) = -3.87 + 1.70 \log(tweets)$.)&lt;/p&gt;

&lt;p&gt;So instead of looking at the ratios above, to figure out which articles are popular on FB vs. Twitter, we could look at the residuals of the above plot. Posts with large positive residuals would be posts that are especially popular on FB, and posts with negative residuals would be posts that are especially popular on Twitter.&lt;/p&gt;

&lt;p&gt;In practice, however, there wasn&amp;#8217;t much difference between looking at residuals vs. ratios directly when using the datasets I had, so to keep things simple in the main discussion above, I stuck to ratios alone. Still, it&amp;#8217;s another option which might be useful when looking at different questions or different sources of data, so just for completeness, here&amp;#8217;s what the FlowingData results look like if we use residuals instead.&lt;/p&gt;

&lt;p&gt;The 10 articles with the highest residuals (i.e., the articles most popular on Facebook):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/06/10/what-you-need-to-get-together/&quot;&gt;What you need to get together&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/02/14/valentines-day-importance/&quot;&gt;Valentine’s Day importance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/01/30/what-your-state-is-the-worst-at-united-states-of-shame/&quot;&gt;What your state is the worst at – United States of shame&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/05/13/plush-statistical-distribution-pillows/&quot;&gt;Plush statistical distribution pillows&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/07/01/hitler-learns-topology/&quot;&gt;Hitler learns topology&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/01/27/dexters-victims-through-season-five/&quot;&gt;Dexter’s victims through season five&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/07/06/access-to-education-where-you-live/&quot;&gt;Access to education where you live&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/03/09/watching-costco-warehouses-open-nationwide/&quot;&gt;Watching the growth of Costco warehouses&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/03/22/are-gas-prices-really-that-high/&quot;&gt;Are gas prices really that high?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/01/21/flight-safety-esque-beer-pong-guide/&quot;&gt;Flight safety-esque beer pong guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The 10 articles with the lowest residuals (i.e., the articles most popular on Twitter):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/05/25/pew-research-raw-survey-data-now-available/&quot;&gt;Pew Research raw survey data now available&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/01/24/explore-your-linkedin-network-visually-with-inmaps/&quot;&gt;Explore your LinkedIn network visually with InMaps&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/02/03/stock-market-predictions-with-twitter/&quot;&gt;Stock market predictions with Twitter&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/01/04/delicious-mass-exodus/&quot;&gt;Delicious mass exodus&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/03/25/open-source-data-science-toolkit/&quot;&gt;Open-source Data Science Toolkit&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/04/17/business-intelligence-vs-infotainment/&quot;&gt;Business intelligence vs. infotainment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/04/20/see-what-you-and-others-tweet-about-with-the-topic-explorer/&quot;&gt;See what you and others tweet about with the Topic Explorer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/01/25/growth-and-usage-of-foursquare-in-2010/&quot;&gt;Growth and usage of foursquare in 2010&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/05/10/flash-vs-html5/&quot;&gt;Flash vs. HTML5&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://flowingdata.com/2011/06/09/gender-and-time-comparisons-on-twitter/&quot;&gt;Gender and time comparisons on Twitter&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Here&amp;#8217;s a density plot of article residuals, split by whether the article has a visualization or not (residuals of picture-free articles are clearly shifted towards the negative end):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/has-viz-residuals.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/has-viz-residuals.png&quot; alt=&quot;Residuals&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are the mean residuals per category (again, we see that the miscellaneous, data underload, data art, and infographics categories tend to be more popular on Facebook, while the data sources, software, online applications, and news categories tend to be more popular on Twitter):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/category-residuals.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/likes-vs-tweets/category-residuals.png&quot; alt=&quot;Category Residuals&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And that&amp;#8217;s it! In the spirit of these findings, I hope this article gets &lt;a href=&quot;http://blog.echen.me/2011/07/28/tweets-vs-likes-what-gets-shared-on-twitter-vs-facebook/?share=facebook&amp;amp;nb=1&quot;&gt;liked&lt;/a&gt; a little and &lt;a href=&quot;https://twitter.com/share?original_referer=http%3A%2F%2Fblog.echen.me%2F2011%2F07%2F28%2Ftweets-vs-likes-what-gets-shared-on-twitter-vs-facebook%2F&amp;amp;source=tweetbutton&amp;amp;text=Tweets%20vs.%20Likes%3A%20What%20gets%20shared%20on%20Twitter%20vs.%20Facebook%3F%3A&amp;amp;url=http%3A%2F%2Fwp.me%2Fpy9AS-6P&quot;&gt;tweeted&lt;/a&gt; lots and lots.&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Introduction to Restricted Boltzmann Machines</title>
    <link href="http://blog.echen.me/2011/07/18/introduction-to-restricted-boltzmann-machines/"/>
    <updated>2011-07-18T09:32:52-07:00</updated>
    <id>http://blog.echen.me/2011/07/18/introduction-to-restricted-boltzmann-machines</id>
    <content type="html">&lt;p&gt;Suppose you ask a bunch of users to rate a set of movies on a 0-100 scale. In classical &lt;a href=&quot;http://en.wikipedia.org/wiki/Factor_analysis&quot;&gt;factor analysis&lt;/a&gt;, you could then try to explain each movie and user in terms of a set of latent &lt;em&gt;factors&lt;/em&gt;. For example, movies like Star Wars and Lord of the Rings might have strong associations with a latent science fiction and fantasy factor, and users who like Wall-E and Toy Story might have strong associations with a latent Pixar factor.&lt;/p&gt;




&lt;p&gt;Restricted Boltzmann Machines essentially perform a &lt;em&gt;binary&lt;/em&gt; version of factor analysis. (This is one way of thinking about RBMs; there are, of course, others, and lots of different ways to use RBMs, but I&amp;#8217;ll adopt this approach for this post.) Instead of users rating a set of movies on a continuous scale, they simply tell you whether they like a movie or not, and the RBM will try to discover latent factors that can explain the activation of these movie choices.&lt;/p&gt;




&lt;p&gt;More technically, a Restricted Boltzmann Machine is a &lt;strong&gt;stochastic neural network&lt;/strong&gt; (&lt;em&gt;neural network&lt;/em&gt; meaning we have neuron-like units whose binary activations depend on the neighbors they&amp;#8217;re connected to; &lt;em&gt;stochastic&lt;/em&gt; meaning these activations have a probabilistic element) consisting of:&lt;/p&gt;




&lt;ul&gt;
&lt;li&gt;One layer of &lt;strong&gt;visible units&lt;/strong&gt; (users&amp;#8217; movie preferences whose states we know and set);&lt;/li&gt;
&lt;li&gt;One layer of &lt;strong&gt;hidden units&lt;/strong&gt; (the latent factors we try to learn); and&lt;/li&gt;
&lt;li&gt;A bias unit (whose state is always on, and is a way of adjusting for the different inherent popularities of each movie).&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Furthermore, each visible unit is connected to all the hidden units (this connection is undirected, so each hidden unit is also connected to all the visible units), and the bias unit is connected to all the visible units and all the hidden units. To make learning easier, we restrict the network so that no visible unit is connected to any other visible unit and no hidden unit is connected to any other hidden unit.&lt;/p&gt;




&lt;p&gt;For example, suppose we have a set of six movies (Harry Potter, Avatar, LOTR 3, Gladiator, Titanic, and Glitter) and we ask users to tell us which ones they want to watch. If we want to learn two latent units underlying movie preferences &amp;#8211; for example, two natural groups in our set of six movies appear to be SF/fantasy (containing Harry Potter, Avatar, and LOTR 3) and Oscar winners (containing LOTR 3, Gladiator, and Titanic), so we might hope that our latent units will correspond to these categories &amp;#8211; then our RBM would look like the following:&lt;/p&gt;




&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/rbms/rbm-example.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/rbms/rbm-example.png&quot; alt=&quot;RBM Example&quot; /&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;(Note the resemblance to a factor analysis graphical model.)&lt;/p&gt;




&lt;h1&gt;State Activation&lt;/h1&gt;




&lt;p&gt;Restricted Boltzmann Machines, and neural networks in general, work by updating the states of some neurons given the states of others, so let&amp;#8217;s talk about how the states of individual units change. Assuming we know the connection weights in our RBM (we&amp;#8217;ll explain how to learn these below), to update the state of unit $i$:&lt;/p&gt;




&lt;ul&gt;
&lt;li&gt;Compute the &lt;strong&gt;activation energy&lt;/strong&gt; $a\_i = &amp;#92;sum\_j w\_{ij} x\_j$ of unit $i$, where the sum runs over all units $j$ that unit $i$ is connected to, $w\_{ij}$ is the weight of the connection between $i$ and $j$, and $x\_j$ is the 0 or 1 state of unit $j$. In other words, all of unit $i$&amp;#8217;s neighbors send it a message, and we compute the sum of all these messages.&lt;/li&gt;
&lt;li&gt;Let $p\_i = &amp;#92;sigma(a\_i)$, where $&amp;#92;sigma(x) = 1/(1 + exp(-x))$ is the logistic function. Note that $p\_i$ is close to 1 for large positive activation energies, and $p\_i$ is close to 0 for negative activation energies.&lt;/li&gt;
&lt;li&gt;We then turn unit $i$ on with probability $p\_i$, and turn it off with probability $1 - p\_i$.&lt;/li&gt;
&lt;li&gt;(In layman&amp;#8217;s terms, units that are positively connected to each other try to get each other to share the same state (i.e., be both on or off), while units that are negatively connected to each other are enemies that prefer to be in different states.)&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;For example, let&amp;#8217;s suppose our two hidden units really do correspond to SF/fantasy and Oscar winners.&lt;/p&gt;




&lt;ul&gt;
&lt;li&gt;If Alice has told us her six binary preferences on our set of movies, we could then ask our RBM which of the hidden units her preferences activate (i.e., ask the RBM to explain her preferences in terms of latent factors). So the six movies send messages to the hidden units, telling them to update themselves. (Note that even if Alice has declared she wants to watch Harry Potter, Avatar, and LOTR 3, this doesn&amp;#8217;t guarantee that the SF/fantasy hidden unit will turn on, but only that it will turn on with high &lt;em&gt;probability&lt;/em&gt;. This makes a bit of sense: in the real world, Alice wanting to watch all three of those movies makes us highly suspect she likes SF/fantasy in general, but there&amp;#8217;s a small chance she wants to watch them for other reasons. Thus, the RBM allows us to &lt;em&gt;generate&lt;/em&gt; models of people in the messy, real world.)&lt;/li&gt;
&lt;li&gt;Conversely, if we know that one person likes SF/fantasy (so that the SF/fantasy unit is on), we can then ask the RBM which of the movie units that hidden unit turns on (i.e., ask the RBM to generate a set of movie recommendations). So the hidden units send messages to the movie units, telling them to update their states. (Again, note that the SF/fantasy unit being on doesn&amp;#8217;t guarantee that we&amp;#8217;ll always recommend all three of Harry Potter, Avatar, and LOTR 3 because, hey, not everyone who likes science fiction liked Avatar.)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;Learning Weights&lt;/h1&gt;




&lt;p&gt;So how do we learn the connection weights in our network? Suppose we have a bunch of training examples, where each training example is a binary vector with six elements corresponding to a user&amp;#8217;s movie preferences. Then for each epoch, do the following:&lt;/p&gt;




&lt;ul&gt;
&lt;li&gt;Take a training example (a set of six movie preferences). Set the states of the visible units to these preferences.&lt;/li&gt;
&lt;li&gt;Next, update the states of the hidden units using the logistic activation rule described above: for the $j$th hidden unit, compute its activation energy $a\_j = &amp;#92;sum\_i w\_{ij} x\_i$, and set $x\_j$ to 1 with probability $&amp;#92;sigma(a\_j)$ and to 0 with probability $1 - &amp;#92;sigma(a\_j)$. Then for each edge $e\_{ij}$, compute $Positive(e\_{ij}) = x\_i \* x\_j$ (i.e., for each pair of units, measure whether they&amp;#8217;re both on).&lt;/li&gt;
&lt;li&gt;Now &lt;strong&gt;reconstruct&lt;/strong&gt; the visible units in a similar manner: for each visible unit, compute its activation energy $a\_i$, and update its state. (Note that this &lt;em&gt;reconstruction&lt;/em&gt; may not match the original preferences.) Then update the hidden units again, and compute $Negative(e\_{ij}) = x\_i \* x\_j$ for each edge.&lt;/li&gt;
&lt;li&gt;Update the weight of each edge $e\_{ij}$ by setting $w\_{ij} = w\_{ij} + L \* (Positive(e\_{ij}) - Negative(e\_{ij}))$, where $L$ is a learning rate.&lt;/li&gt;
&lt;li&gt;Repeat over all training examples.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Continue until the network converges (i.e., the error between the training examples and their reconstructions falls below some threshold) or we reach some maximum number of epochs.&lt;/p&gt;




&lt;p&gt;Why does this update rule make sense? Note that&lt;/p&gt;




&lt;ul&gt;
&lt;li&gt;In the first phase, $Positive(e\_{ij})$ measures the association between the $i$th and $j$th unit that we &lt;em&gt;want&lt;/em&gt; the network to learn from our training examples;&lt;/li&gt;
&lt;li&gt;In the &amp;#8220;reconstruction&amp;#8221; phase, where the RBM generates the states of visible units based on its hypotheses about the hidden units alone, $Negative(e\_{ij})$ measures the association that the network &lt;em&gt;itself&lt;/em&gt; generates (or &amp;#8220;daydreams&amp;#8221; about) when no units are fixed to training data.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;So by adding $Positive(e\_{ij}) - Negative(e\_{ij})$ to each edge weight, we&amp;#8217;re helping the network&amp;#8217;s daydreams better match the reality of our training examples.&lt;/p&gt;




&lt;p&gt;(You may hear this update rule called &lt;strong&gt;contrastive divergence&lt;/strong&gt;, which is basically a funky term for &amp;#8220;approximate gradient descent&amp;#8221;.)&lt;/p&gt;




&lt;h1&gt;Examples&lt;/h1&gt;




&lt;p&gt;I wrote &lt;a href=&quot;https://github.com/echen/restricted-boltzmann-machines&quot;&gt;a simple RBM implementation&lt;/a&gt; in Python (the code is heavily commented, so take a look if you&amp;#8217;re still a little fuzzy on how everything works), so let&amp;#8217;s use it to walk through some examples.&lt;/p&gt;




&lt;p&gt;First, I trained the RBM using some fake data.&lt;/p&gt;




&lt;ul&gt;
&lt;li&gt;Alice: (Harry Potter = 1, Avatar = 1, LOTR 3 = 1, Gladiator = 0, Titanic = 0, Glitter = 0). Big SF/fantasy fan.&lt;/li&gt;
&lt;li&gt;Bob: (Harry Potter = 1, Avatar = 0, LOTR 3 = 1, Gladiator = 0, Titanic = 0, Glitter = 0). SF/fantasy fan, but doesn&amp;#8217;t like Avatar.&lt;/li&gt;
&lt;li&gt;Carol: (Harry Potter = 1, Avatar = 1, LOTR 3 = 1, Gladiator = 0, Titanic = 0, Glitter = 0). Big SF/fantasy fan.&lt;/li&gt;
&lt;li&gt;David: (Harry Potter = 0, Avatar = 0, LOTR 3 = 1, Gladiator = 1, Titanic = 1, Glitter = 0). Big Oscar winners fan.&lt;/li&gt;
&lt;li&gt;Eric:  (Harry Potter = 0, Avatar = 0, LOTR 3 = 1, Gladiator = 1, Titanic = 1, Glitter = 0). Oscar winners fan, except for Titanic.&lt;/li&gt;
&lt;li&gt;Fred: (Harry Potter = 0, Avatar = 0, LOTR 3 = 1, Gladiator = 1, Titanic = 1, Glitter = 0). Big Oscar winners fan.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;The network learned the following weights:&lt;/p&gt;




&lt;pre&gt;&lt;code&gt;                 Bias Unit       Hidden 1        Hidden 2
Bias Unit       -0.08257658     -0.19041546      1.57007782
Harry Potter    -0.82602559     -7.08986885      4.96606654
Avatar          -1.84023877     -5.18354129      2.27197472
LOTR 3           3.92321075      2.51720193      4.11061383
Gladiator        0.10316995      6.74833901     -4.00505343
Titanic         -0.97646029      3.25474524     -5.59606865
Glitter         -4.44685751     -2.81563804     -2.91540988
&lt;/code&gt;&lt;/pre&gt;




&lt;p&gt;Note that the first hidden unit seems to correspond to the Oscar winners, and the second hidden unit seems to correspond to the SF/fantasy movies, just as we were hoping.&lt;/p&gt;




&lt;p&gt;What happens if we give the RBM a new user, George, who has (Harry Potter = 0, Avatar = 0, LOTR 3 = 0, Gladiator = 1, Titanic = 1, Glitter = 0) as his preferences? It turns the Oscar winners unit on (but not the SF/fantasy unit), correctly guessing that George probably likes movies that are Oscar winners.&lt;/p&gt;




&lt;p&gt;What happens if we activate only the SF/fantasy unit, and run the RBM a bunch of different times? In my trials, it turned on Harry Potter, Avatar, and LOTR 3 three times; it turned on Avatar and LOTR 3, but not Harry Potter, once; and it turned on Harry Potter and LOTR 3, but not Avatar, twice. Note that, based on our training examples, these generated preferences do indeed match what we might expect real SF/fantasy fans want to watch.&lt;/p&gt;




&lt;h1&gt;Modifications&lt;/h1&gt;




&lt;p&gt;I tried to keep the connection-learning algorithm I described above pretty simple, so here are some modifications that often appear in practice:&lt;/p&gt;




&lt;ul&gt;
&lt;li&gt;Above, $Negative(e\_{ij})$ was determined by taking the product of the $i$th and $j$th units after reconstructing the visible units &lt;em&gt;once&lt;/em&gt; and then updating the hidden units again. We could also take the product after some larger number of reconstructions (i.e., repeat updating the visible units, then the hidden units, then the visible units again, and so on); this is slower, but describes the network&amp;#8217;s daydreams more accurately.&lt;/li&gt;
&lt;li&gt;Instead of using $Positive(e\_{ij})=x\_i \* x\_j$, where $x\_i$ and $x\_j$ are binary 0 or 1 &lt;em&gt;states&lt;/em&gt;, we could also let $x\_i$ and/or $x\_j$ be activation &lt;em&gt;probabilities&lt;/em&gt;. Similarly for $Negative(e\_{ij})$.&lt;/li&gt;
&lt;li&gt;We could penalize larger edge weights, in order to get a sparser or more regularized model.&lt;/li&gt;
&lt;li&gt;When updating edge weights, we could use a momentum factor: we would add to each edge a weighted sum of the current step as described above (i.e., $L \* (Positive(e\_{ij}) - Negative(e\_{ij})$) and the step previously taken.&lt;/li&gt;
&lt;li&gt;Instead of using only one training example in each epoch, we could use &lt;em&gt;batches&lt;/em&gt; of examples in each epoch, and only update the network&amp;#8217;s weights after passing through all the examples in the batch. This can speed up the learning by taking advantage of fast matrix-multiplication algorithms.&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;Further&lt;/h1&gt;




&lt;p&gt;If you&amp;#8217;re interested in learning more about Restricted Boltzmann Machines, here are some good links.&lt;/p&gt;




&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf&quot;&gt;A Practical guide to training restricted Boltzmann machines&lt;/a&gt;, by Geoffrey Hinton.&lt;/li&gt;
&lt;li&gt;A talk by Andrew Ng on &lt;a href=&quot;http://www.youtube.com/watch?v=ZmNOAtZIgIk&quot;&gt;Unsupervised Feature Learning and Deep Learning&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.machinelearning.org/proceedings/icml2007/papers/407.pdf&quot;&gt;Restricted Boltzmann Machines for Collaborative Filtering&lt;/a&gt;. I found this paper hard to read, but it&amp;#8217;s an interesting application to the Netflix Prize.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://arxiv.org/abs/0908.4425&quot;&gt;Geometry of the Restricted Boltzmann Machine&lt;/a&gt;. A very readable introduction to RBMs, &amp;#8220;starting with the observation that its Zariski closure is a Hadamard power of the first secant variety of the Segre variety of projective lines&amp;#8221;. (I kid, I kid.)&lt;/li&gt;
&lt;/ul&gt;

</content>
  </entry>
  
  <entry>
    <title>Topic Modeling the Sarah Palin Emails</title>
    <link href="http://blog.echen.me/2011/06/27/topic-modeling-the-sarah-palin-emails/"/>
    <updated>2011-06-27T17:19:42-07:00</updated>
    <id>http://blog.echen.me/2011/06/27/topic-modeling-the-sarah-palin-emails</id>
    <content type="html">&lt;h1&gt;LDA-based Email Browser&lt;/h1&gt;

&lt;p&gt;Earlier this month, several thousand emails from Sarah Palin&amp;#8217;s time as governor of Alaska were &lt;a href=&quot;http://sunlightlabs.com/blog/2011/sarahs-inbox/&quot;&gt;released&lt;/a&gt;. The emails weren&amp;#8217;t organized in any fashion, though, so to make them easier to browse, I&amp;#8217;ve been working on some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups.&lt;/p&gt;

&lt;p&gt;I threw up &lt;a href=&quot;http://sarah-palin.heroku.com/&quot;&gt;a simple demo app&lt;/a&gt; to view the organized documents &lt;a href=&quot;http://sarah-palin.heroku.com/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;What is Latent Dirichlet Allocation?&lt;/h1&gt;

&lt;p&gt;Briefly, given a set of documents, LDA tries to learn the latent topics underlying the set. It represents each document as a mixture of topics (generated from a Dirichlet distribution), each of which emits words with a certain probability.&lt;/p&gt;

&lt;p&gt;For example, given the sentence &amp;#8220;I listened to Justin Bieber and Lady Gaga on the radio while driving around in my car&amp;#8221;, an LDA model might represent this sentence as 75% about music (a topic which, say, emits the words &lt;em&gt;Bieber&lt;/em&gt; with 10% probability, &lt;em&gt;Gaga&lt;/em&gt; with 5% probability, &lt;em&gt;radio&lt;/em&gt; with 1% probability, and so on) and 25% about cars (which might emit &lt;em&gt;driving&lt;/em&gt; with 15% probability and &lt;em&gt;cars&lt;/em&gt; with 10% probability).&lt;/p&gt;

&lt;p&gt;If you&amp;#8217;re familiar with latent semantic analysis, you can think of LDA as a generative version. (For a more in-depth explanation, I wrote an introduction to LDA &lt;a href=&quot;http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/&quot;&gt;here&lt;/a&gt;.)&lt;/p&gt;

&lt;h1&gt;Sarah Palin Email Topics&lt;/h1&gt;

&lt;p&gt;Here&amp;#8217;s a sample of the topics learnt by the model, as well as the top words for each topic. (Names, of course, are based on my own interpretation.)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://sarah-palin.heroku.com/topics/24&quot;&gt;&lt;strong&gt;Wildlife/BP Corrosion&lt;/strong&gt;&lt;/a&gt;: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, …&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://sarah-palin.heroku.com/topics/0&quot;&gt;&lt;strong&gt;Energy/Fuel/Oil/Mining&lt;/strong&gt;&lt;/a&gt;: energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, …&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://sarah-palin.heroku.com/topics/19&quot;&gt;&lt;strong&gt;Trig/Family/Inspiration&lt;/strong&gt;&lt;/a&gt;: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, …&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://sarah-palin.heroku.com/topics/6&quot;&gt;&lt;strong&gt;Gas&lt;/strong&gt;&lt;/a&gt;: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, …&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://sarah-palin.heroku.com/topics/12&quot;&gt;&lt;strong&gt;Education/Waste&lt;/strong&gt;&lt;/a&gt;: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, …&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://sarah-palin.heroku.com/topics/15&quot;&gt;&lt;strong&gt;Presidential Campaign/Elections&lt;/strong&gt;&lt;/a&gt;: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, …&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Here&amp;#8217;s a sample email from the wildlife topic:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://sarah-palin.heroku.com/emails/6719&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/palin-browser/wildlife-email.png&quot; alt=&quot;Wildlife Email&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I also thought the classification for &lt;a href=&quot;http://sarah-palin.heroku.com/emails/12900&quot;&gt;this email&lt;/a&gt; was really neat: the LDA model labeled it as 10% in the &lt;a href=&quot;http://sarah-palin.heroku.com/topics/15&quot;&gt;Presidential Campaign/Elections&lt;/a&gt; topic and 90% in the &lt;a href=&quot;http://sarah-palin.heroku.com/topics/24&quot;&gt;Wildlife&lt;/a&gt; topic, and it&amp;#8217;s precisely a wildlife-based protest against Palin as a choice for VP:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://sarah-palin.heroku.com/emails/12900&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/palin-browser/wildlife-vp.png&quot; alt=&quot;Wildlife-VP Protest&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;Future Analysis&lt;/h1&gt;

&lt;p&gt;In a future post, I&amp;#8217;ll perhaps see if we can glean any interesting patterns from the email topics. For example, for a quick graph now, if we look at the percentage of emails in the &lt;a href=&quot;http://sarah-palin.heroku.com/topics/19&quot;&gt;Trig/Family/Inspiration topic&lt;/a&gt; across time, we see that there&amp;#8217;s a spike in April 2008 &amp;#8211; exactly (and unsurprisingly) the month in which Trig was born.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/palin-browser/trig-topic.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/palin-browser/trig-topic.png&quot; alt=&quot;Trig&quot; /&gt;&lt;/a&gt;&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Filtering for English Tweets: Unsupervised Language Detection on Twitter</title>
    <link href="http://blog.echen.me/2011/05/01/unsupervised-language-detection-algorithms/"/>
    <updated>2011-05-01T16:28:07-07:00</updated>
    <id>http://blog.echen.me/2011/05/01/unsupervised-language-detection-algorithms</id>
    <content type="html">&lt;p&gt;(See a demo &lt;a href=&quot;http://babel-fett.heroku.com/&quot;&gt;here&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;While working on a Twitter sentiment analysis project, I ran into the problem of needing to filter out all non-English tweets. (Asking the Twitter API for English-only tweets doesn&amp;#8217;t seem to work, as it nonetheless returns tweets in Spanish, Portuguese, Dutch, Russian, and a couple other languages.)&lt;/p&gt;

&lt;p&gt;Since I didn&amp;#8217;t have any labeled data, I thought it would be fun to build an &lt;strong&gt;unsupervised&lt;/strong&gt; language classifier. In particular, using an EM algorithm to build a naive Bayes model of English vs. non-English n-gram probabilities turned out to work quite well, so here&amp;#8217;s a description.&lt;/p&gt;

&lt;h1&gt;EM Algorithm&lt;/h1&gt;

&lt;p&gt;Let&amp;#8217;s recall the naive Bayes algorithm: given a tweet (a set of &lt;em&gt;character&lt;/em&gt; n-grams), we estimate its language to be the language $L$ that maximizes&lt;/p&gt;

&lt;p&gt;$$P(language = L | ngrams) \propto P(ngrams | language = L) P(language = L)$$&lt;/p&gt;

&lt;p&gt;Thus, we need to estimate $P(ngram | language = L)$ and $P(language = L)$.&lt;/p&gt;

&lt;p&gt;This would be easy &lt;strong&gt;if we knew the language of each tweet&lt;/strong&gt;, since we could estimate&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$P(xyz| language = English)$ as #(number of times &amp;#8220;xyz&amp;#8221; is a trigram in the English tweets) / #(total trigrams in the English tweets)&lt;/li&gt;
&lt;li&gt;$P(language = English)$ as the proportion of English tweets.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Or, it would also be easy &lt;strong&gt;if we knew the n-gram probabilities for each language&lt;/strong&gt;, since we could use Bayes&amp;#8217; theorem to compute the language &lt;em&gt;probabilities&lt;/em&gt; for each tweet, and then take a weighted variant of the previous paragraph.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem is that we know neither of these.&lt;/strong&gt; So what the EM algorithm says is that that we can simply &lt;strong&gt;guess&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pretend we know the language of each tweet (by randomly assigning them at the beginning).&lt;/li&gt;
&lt;li&gt;Using this guess, we can compute the n-gram probabilities for each language.&lt;/li&gt;
&lt;li&gt;Using the n-gram probabilities for each language, we can recompute the language probabilities of each tweet.&lt;/li&gt;
&lt;li&gt;Using these recomputed language probabilities, we can recompute the n-gram probabilities.&lt;/li&gt;
&lt;li&gt;And so on, recomputing the language probabilities and n-gram probabilities over and over. While our guesses will be off in the beginning, the probabilities will eventually converge to (locally) minimize the likelihood. (In my tests, my language detector would sometimes correctly converge to an English detector, and sometimes it would converge to an English-and-Dutch detector.)&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;EM Analogy for the Layman&lt;/h2&gt;

&lt;p&gt;Why does this work? Suppose you suddenly move to New York, and you want a way to differentiate between tourists and New Yorkers based on their activities. Initially, you don&amp;#8217;t know who&amp;#8217;s a tourist and who&amp;#8217;s a New Yorker, and you don&amp;#8217;t know which are touristy activities and which are not. So you randomly place people into two groups A and B. (You randomly assign all tweets to a language)&lt;/p&gt;

&lt;p&gt;Now, given all the people in group A, you notice that a large number of them visit the Statue of Liberty; similarly, you notice that a large number of people in group B walk really quickly. (You notice that one set of words often has the n-gram &amp;#8220;ing&amp;#8221;, and that another set of words often has the n-gram &amp;#8220;ias&amp;#8221;; that is, you fix the language probabilities for each tweet, and recompute the n-gram probabilities for each language.)&lt;/p&gt;

&lt;p&gt;So you start to put people visiting the Statue of Liberty in group A, and you start to put fast walkers in group B. (You fix the n-gram probabilities for each language, and recompute the language probabilities for each tweet.)&lt;/p&gt;

&lt;p&gt;With your new A and B groups, you notice more differentiating factors: group A people tend to carry along cameras, and group B people tend to be more finance-savvy.&lt;/p&gt;

&lt;p&gt;So you start to put camera-carrying folks in group A, and finance-savvy folks in group B.&lt;/p&gt;

&lt;p&gt;And so on. Eventually, you settle on two groups of people and differentiating activities: people who walk slowly and visit the Statue of Liberty, and busy-looking people who walk fast and don&amp;#8217;t visit. Assuming there are more native New Yorkers than tourists, you can then guess that the natives are the larger group.&lt;/p&gt;

&lt;h1&gt;Results&lt;/h1&gt;

&lt;p&gt;I wrote some Ruby code to implement the above algorithm, and trained it on half a million tweets, using English and &amp;#8220;not English&amp;#8221; as my two languages. The results looked surprisingly good from just eyeballing:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://img.skitch.com/20110303-qfrnb8gstgheh4xech4iutfskd.jpg&quot;&gt;&lt;img src=&quot;https://img.skitch.com/20110303-qfrnb8gstgheh4xech4iutfskd.jpg&quot; alt=&quot;Example Results&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But in order to get some hard metrics and to tune parameters (e.g., n-gram size), I needed a labeled dataset. So I pulled a set of English-language and Spanish-language documents from Project Gutenberg, and split them to form training and test sets (the training set consisted of 2000 lines of English and 1000 lines of Spanish, and  1000 lines of English and 1000 lines of Spanish for the test set).&lt;/p&gt;

&lt;p&gt;Trained on bigrams, the detector resulted in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;991 true positives (English lines correctly classified as English)&lt;/li&gt;
&lt;li&gt;9 false negatives (English lines incorrectly classified as Spanish&lt;/li&gt;
&lt;li&gt;11 false positives (Spanish lines incorrectly classified as English)&lt;/li&gt;
&lt;li&gt;989 true negatives (Spanish lines correctly classified as English)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;for a precision of 0.989 and a recall of 0.991.&lt;/p&gt;

&lt;p&gt;Trained on trigrams, the detector resulted in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;992 true positives&lt;/li&gt;
&lt;li&gt;8 false negatives&lt;/li&gt;
&lt;li&gt;10 false positives&lt;/li&gt;
&lt;li&gt;990 true negatives&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;for a precision of 0.990 and a recall of 0.992.&lt;/p&gt;

&lt;p&gt;Also, when I looked at the sentences the detector was making errors on, I saw that they almost always consisted of only one or two words (e.g., the incorrectly classified sentences were lines like &amp;#8220;inmortal&amp;#8221;, &amp;#8220;autumn&amp;#8221;, and &amp;#8220;salir&amp;#8221;). So the detector pretty much never made a mistake on a normal sentence!&lt;/p&gt;

&lt;h1&gt;Code/Demo&lt;/h1&gt;

&lt;p&gt;I put the code on &lt;a href=&quot;https://github.com/echen/unsupervised-language-identification&quot;&gt;my Github account&lt;/a&gt;, and a quick &lt;a href=&quot;http://babel-fett.heroku.com/&quot;&gt;demo app&lt;/a&gt;, trained on trigrams from tweets with lang=&amp;#8221;en&amp;#8221; according to the Twitter API, is &lt;a href=&quot;http://babel-fett.heroku.com/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Choosing a Machine Learning Classifier</title>
    <link href="http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/"/>
    <updated>2011-04-27T18:43:15-07:00</updated>
    <id>http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier</id>
    <content type="html">&lt;p&gt;How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you&amp;#8217;re simply looking for a &amp;#8220;good enough&amp;#8221; algorithm for your problem, or a place to start, here are some general guidelines I&amp;#8217;ve found to work well over the years.&lt;/p&gt;

&lt;h1&gt;How large is your training set?&lt;/h1&gt;

&lt;p&gt;If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren&amp;#8217;t powerful enough to provide accurate models.&lt;/p&gt;

&lt;p&gt;You can also think of this as a generative model vs. discriminative model distinction.&lt;/p&gt;

&lt;h1&gt;Advantages of some particular algorithms&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Advantages of Naive Bayes:&lt;/strong&gt; Super simple, you&amp;#8217;re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn&amp;#8217;t hold, a NB classifier still often does a great job in practice. A good bet if  want something fast and easy that performs pretty well. Its main disadvantage is that it can&amp;#8217;t learn interactions between features (e.g., it can&amp;#8217;t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they&amp;#8217;re together).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantages of Logistic Regression:&lt;/strong&gt; Lots of ways to regularize your model, and you don&amp;#8217;t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you&amp;#8217;re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantages of Decision Trees:&lt;/strong&gt; Easy to interpret and explain (for some people &amp;#8211; I&amp;#8217;m not sure I fall into this camp). They easily handle feature interactions and they&amp;#8217;re non-parametric, so you don&amp;#8217;t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don&amp;#8217;t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit, but that&amp;#8217;s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they&amp;#8217;re fast and scalable, and you don&amp;#8217;t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantages of SVMs:&lt;/strong&gt; High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you&amp;#8217;re data isn&amp;#8217;t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive, hard to interpret, and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.&lt;/p&gt;

&lt;h1&gt;But&amp;#8230;&lt;/h1&gt;

&lt;p&gt;Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, then whichever classification algorithm you use might not matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).&lt;/p&gt;

&lt;p&gt;And to reiterate what I said above, if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize (and Middle Earth), just use an ensemble method to choose them all.&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>Kickstarter Data Analysis: Success and Pricing</title>
    <link href="http://blog.echen.me/2011/04/25/kickstarter-data-analysis-success-and-pricing/"/>
    <updated>2011-04-25T21:19:40-07:00</updated>
    <id>http://blog.echen.me/2011/04/25/kickstarter-data-analysis-success-and-pricing</id>
    <content type="html">&lt;p&gt;&lt;a href=&quot;http://www.kickstarter.com/&quot;&gt;Kickstarter&lt;/a&gt; is an online crowdfunding platform for launching creative projects. When starting a new project, project owners specify a deadline and the minimum amount of money they need to raise. They receive the money (less a transaction fee) only if they reach or exceed that minimum; otherwise, no money changes hands.&lt;/p&gt;

&lt;p&gt;What&amp;#8217;s particularly fun about Kickstarter is that in contrast to &lt;a href=&quot;http://www.kiva.org/&quot;&gt;that other microfinance site&lt;/a&gt;, Kickstarter projects don&amp;#8217;t ask for loans; instead, patrons receive pre-specified rewards unique to each project. For example, someone donating money to help an artist record an album might receive a digital copy of the album if they donate 20 dollars, or a digital copy plus a signed physical cd if they donate 50 dollars.&lt;/p&gt;

&lt;p&gt;There are &lt;a href=&quot;http://www.kickstarter.com/discover/hall-of-fame?ref=sidebar&quot;&gt;a bunch&lt;/a&gt; of &lt;a href=&quot;http://www.kickstarter.com/projects/1104350651/tiktok-lunatik-multi-touch-watch-kits&quot;&gt;neat&lt;/a&gt; &lt;a href=&quot;http://www.kickstarter.com/projects/2024077040/neil-gaimans-the-price&quot;&gt;projects&lt;/a&gt;, and I&amp;#8217;m tempted to put one of my own on there soon, so I thought it would be fun to gather some data from the site and see what makes a project successful.&lt;/p&gt;

&lt;h1&gt;Categories&lt;/h1&gt;

&lt;p&gt;I started by scraping the categories section.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/successful-projects-by-category.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/successful-projects-by-category.png&quot; alt=&quot;Successful projects by category&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In true indie fashion, the artsy categories tend to dominate. (I&amp;#8217;m surprised/disappointed how little love the Technology category gets.)&lt;/p&gt;

&lt;h1&gt;Ending Soon&lt;/h1&gt;

&lt;p&gt;The categories section really only provides a history of &lt;em&gt;successful&lt;/em&gt; projects, though, so to get some data on unsuccessful projects as well, I took a look at the &lt;a href=&quot;http://www.kickstarter.com/discover/ending-soon?ref=sidebar&quot;&gt;Ending Soon&lt;/a&gt; section of projects whose deadlines are about to pass.&lt;/p&gt;

&lt;p&gt;It looks like about 50% of all Kickstarter projects get successfully funded by the deadline:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/ending-soon-success.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/ending-soon-success.png&quot; alt=&quot;Successful projects as deadline approaches&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interestingly, most of the final funding seems to happen in the final few days: with just 5 days left, only about 20% of all projects have been funded. (In other words, with just 5 days left, 60% of the projects that will eventually be successful are still unfunded.) So the approaching deadline seems to really spur people to donate. I wonder if it&amp;#8217;s because of increased publicity in the final few days (the project owners begging everyone for help!) or if it&amp;#8217;s simply procrastination in action (perhaps people want to wait to see if their donation is really necessary)?&lt;/p&gt;

&lt;p&gt;Lesson: if you&amp;#8217;re still not fully funded with only a couple days remaining, don&amp;#8217;t despair.&lt;/p&gt;

&lt;h1&gt;Success vs. Failure&lt;/h1&gt;

&lt;p&gt;What factors lead a project to succeed? Are there any quantitative differences between projects that eventually get funded and those that don&amp;#8217;t?&lt;/p&gt;

&lt;p&gt;Two simple (if kind of obvious) things I noticed are that unsuccessful projects tend to require a larger amount of money:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/successful-vs-unsuccessful-goal.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/successful-vs-unsuccessful-goal.png&quot; alt=&quot;Unsuccessful projects tend to ask for more money&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and unsuccessful projects also tend to raise less money in absolute terms (i.e., it&amp;#8217;s not just that they ask for too much money to reach their goal &amp;#8211; they&amp;#8217;re simply not receiving enough money as well):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/successful-vs-unsuccessful-amount-pledged.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/successful-vs-unsuccessful-amount-pledged.png&quot; alt=&quot;Unsuccessful projects received less money&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Not terribly surprising, but it&amp;#8217;s good to confirm (and I&amp;#8217;m still working on finding other predictors).&lt;/p&gt;

&lt;h1&gt;Pledge Rewards&lt;/h1&gt;

&lt;p&gt;There&amp;#8217;s a lot of interesting work in behavioral economics on pricing and choice &amp;#8211; for example, the &lt;a href=&quot;http://youarenotsosmart.com/2010/07/27/anchoring-effect/&quot;&gt;anchoring effect&lt;/a&gt; suggests that when building a menu, you should &lt;a href=&quot;http://www.neurosciencemarketing.com/blog/articles/neuro-menus-and-restaurant-psychology.htm&quot;&gt;include an expensive item&lt;/a&gt; to make other menu items look reasonably priced in comparison, and the &lt;a href=&quot;http://en.wikipedia.org/wiki/The_Paradox_of_Choice:_Why_More_Is_Less&quot;&gt;paradox of choice &lt;/a&gt; suggests that too many choices lead to a decision freeze &amp;#8211; so one aspect of the Kickstarter data I was especially interested in was how pricing of rewards affects donations. For example, does pricing the lowest reward at 25 dollars lead to more money donated (people don&amp;#8217;t lowball at 5 dollars instead) or less money donated (25 dollars is more money than most people are willing to give)? And what happens if a new reward at 5 dollars is added &amp;#8211; again, does it lead to more money (now people can donate something they can afford) or less money (the people that would have paid 25 dollars switch to a 5 dollar donation)?&lt;/p&gt;

&lt;p&gt;First, here&amp;#8217;s a look at the total number of pledges at each price. (More accurately, it&amp;#8217;s the number of claimed rewards at each price.) [Update: the original version of this graph was wrong, but I&amp;#8217;ve since fixed it.]&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/pledge%20amounts.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/pledge%20amounts.png&quot; alt=&quot;Pledge Amounts&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Surprisingly, 5 dollar and 1 dollar donations are actually not the most common contribution.&lt;/p&gt;

&lt;p&gt;To investigate pricing effects, I started by looking at all (successful) projects that had a reward priced at 1 dollar, and compared the number of donations at 1 dollar with the number of donations at the next lowest reward.&lt;/p&gt;

&lt;p&gt;Up to about 15-20 dollars, there&amp;#8217;s a steady increase in the proportion of people who choose the second reward over the first reward, but after that, the proportion decreases.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/anchoring.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/anchoring.png&quot; alt=&quot;Anchoring&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/anchoring-abline-b.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/kickstarter/anchoring-abline-b.png&quot; alt=&quot;Anchoring with Regression Lines&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So this perhaps suggests that if you&amp;#8217;re going to price your lowest reward at 1 dollar, your next reward should cost roughly 20 dollars (or slightly more, to maximize your total revenue). Pricing above 20 dollars is a little too expensive for the folks who want to support you, but aren&amp;#8217;t rich enough to throw gads of money; maybe rewards below 20 dollars aren&amp;#8217;t good enough to merit the higher donation.&lt;/p&gt;

&lt;p&gt;Next, I&amp;#8217;m planning on digging a little deeper into pricing effects and what makes a project successful, so I&amp;#8217;ll hopefully have some more Kickstarter analysis in a future post. In the meantime, in case anyone else wants to take a look, I put the data onto &lt;a href=&quot;https://github.com/echen/kickstarter-data-analysis&quot;&gt;my Github account&lt;/a&gt;.&lt;/p&gt;
</content>
  </entry>
  
  <entry>
    <title>A Mathematical Introduction to Least Angle Regression</title>
    <link href="http://blog.echen.me/2011/04/21/a-mathematical-introduction-to-least-angle-regression/"/>
    <updated>2011-04-21T00:16:36-07:00</updated>
    <id>http://blog.echen.me/2011/04/21/a-mathematical-introduction-to-least-angle-regression</id>
    <content type="html">&lt;p&gt;(For a layman&amp;#8217;s introduction, see &lt;a href=&quot;http://blog.echen.me/2011/03/14/least-angle-regression-for-the-hungry-layman/&quot;&gt;here&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;Least Angle Regression (aka LARS) is a &lt;strong&gt;model selection method&lt;/strong&gt; for linear regression (when you&amp;#8217;re worried about overfitting or want your model to be easily interpretable). To motivate it, let&amp;#8217;s consider some other model selection methods:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Forward selection&lt;/strong&gt; starts with no variables in the model, and at each step it adds to the model the variable with the most explanatory power, stopping if the explanatory power falls below some threshold. This is a fast and simple method, but it can also be too greedy: we fully add variables at each step, so correlated predictors don&amp;#8217;t get much of a chance to be included in the model. (For example, suppose we want to build a model for the deliciousness of a PB&amp;amp;J sandwich, and two of our variables are the amount of peanut butter and the amount of jelly. We&amp;#8217;d like both variables to appear in our model, but since amount of peanut butter is (let&amp;#8217;s assume) strongly correlated with the amount of jelly, once we fully add peanut butter to our model, jelly doesn&amp;#8217;t add much explanatory power anymore, and so it&amp;#8217;s unlikely to be added.)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Forward stagewise regression&lt;/strong&gt; tries to remedy the greediness of forward selection by only partially adding variables. Whereas forward selection finds the variable with the most explanatory power and goes all out in adding it to the model, forward stagewise finds the variable with the most explanatory power and updates its weight by only epsilon in the correct direction. (So we might first increase the weight of peanut butter a little bit, then increase the weight of peanut butter again, then increase the weight of jelly, then increase the weight of bread, and then increase the weight of peanut butter once more.) The problem now is that we have to make a ton of updates, so forward stagewise can be very inefficient.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;LARS, then, is essentially forward stagewise made fast. Instead of making tiny hops in the direction of one variable at a time, LARS makes optimally-sized leaps in optimal directions. These directions are chosen to make equal angles (equal correlations) with each of the variables currently in our model. (We like peanut butter best, so we start eating it first; as we eat more, we get a little sick of it, so jelly starts looking equally appetizing, and we start eating peanut butter and jelly simultaneously; later, we add bread to the mix, etc.)&lt;/p&gt;

&lt;p&gt;In more detail, LARS works as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Assume for simplicity that we&amp;#8217;ve standardized our explanatory variables to have zero mean and unit variance, and that our response variable also has zero mean.&lt;/li&gt;
&lt;li&gt;Start with no variables in your model.&lt;/li&gt;
&lt;li&gt;Find the variable $ x_1 $ most correlated with the residual. (Note that the variable most correlated with the residual is equivalently the one that makes the least angle with the residual, whence the name.)&lt;/li&gt;
&lt;li&gt;Move in the direction of this variable until some other variable $ x_2 $ is just as correlated.&lt;/li&gt;
&lt;li&gt;At this point, start moving in a direction such that the residual stays equally correlated with $ x_1 $ and $ x_2 $ (i.e., so that the residual makes equal angles with both variables), and keep moving until some variable $ x_3 $ becomes equally correlated with our residual.&lt;/li&gt;
&lt;li&gt;And so on, stopping when we&amp;#8217;ve decided our model is big enough.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;For example, consider the following image (slightly simplified from the &lt;a href=&quot;http://www.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf&quot;&gt;original LARS paper&lt;/a&gt;; $x_1, x_2$ are our variables, and $y$ is our response):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://dl.dropbox.com/u/10506/blog/lars/lars-example.png&quot;&gt;&lt;img src=&quot;http://dl.dropbox.com/u/10506/blog/lars/lars-example.png&quot; alt=&quot;LARS Example&quot; /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our model starts at $ \hat{\mu_0} $.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The residual (the green line) makes a smaller angle with $ x_1 $ than with $ x_2 $, so we start moving in the direction of $ x_1 $.
At $ \hat{\mu_1} $, the residual now makes equal angles with $ x_1, x_2 $, and so we start moving in a new direction that preserves this equiangularity/equicorrelation.&lt;/li&gt;
&lt;li&gt;If there were more variables, we&amp;#8217;d change directions again once a new variable made equal angles with our residual, and so on.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So when should you use LARS, as opposed to some other regularization method like lasso? There&amp;#8217;s not really a clear-cut answer, but LARS tends to give very similar results as both lasso and forward stagewise (in fact, slight modifications to LARS give you lasso and forward stagewise), so I tend to just use lasso when I do these kinds of things, since the justifications for lasso make a little more sense to me. In fact, I don&amp;#8217;t usually even think of LARS as a model selection method in its own right, but rather as a way to efficiently implement lasso (especially if you want to compute the full regularization path).&lt;/p&gt;
</content>
  </entry>
  
</feed>
