Stuff Harvard People Like

What types of students go to which schools? There are, of course, the classic stereotypes:

• MIT has the hacker engineers.
• Stanford has the laid-back, social folks.
• Harvard has the prestigious leaders of the world.
• Berkeley has the activist hippies.
• Caltech has the hardcore science nerds.

But how well do these perceptions match reality? What are students at Stanford, Harvard, MIT, Caltech, and Berkeley really interested in? Following the path of my previous data-driven post on differences between Silicon Valley and NYC, I scraped the Quora profiles of a couple hundred followers of each school to find out.

Topics

So let’s look at what kinds of topics followers of each school are interested in*. (Skip past the lists for a discussion.)

MIT

Topics are followed by p(school = MIT|topic).

• MIT Media Lab 0.893
• Ksplice 0.69
• Lisp (programming language) 0.677
• Nokia 0.659
• Public Speaking 0.65
• Data Storage 0.65
• Google Voice 0.609
• Hacking 0.602
• Startups in Europe 0.597
• Startup Names 0.572
• Mechanical Engineering 0.563
• Engineering 0.563
• Distributed Databases 0.544
• StackOverflow 0.536
• Boston 0.513
• Learning 0.507
• Open Source 0.498
• Cambridge 0.496
• Public Relations 0.493
• Visualization 0.492
• Semantic Web 0.486
• Andreessen-Horowitz 0.483
• Nature 0.475
• Cryptography 0.474
• Startups in Boston 0.452
• Adobe Photoshop 0.451
• Computer Security 0.447
• Sachin Tendulkar 0.443
• Hacker News 0.442
• Games 0.429
• Android Applications 0.428
• Best Engineers and Programmers 0.427
• College Admissions & Getting Into College 0.422
• Co-Founders 0.419
• Big Data 0.41
• System Administration 0.4
• Biotechnology 0.398
• Higher Education 0.394
• NoSQL 0.387
• User Experience 0.386
• Career Advice 0.377
• Artificial Intelligence 0.375
• Scalability 0.37
• Taylor Swift 0.368
• Google Search 0.368
• Functional Programming 0.365
• Bing 0.363
• Bioinformatics 0.361
• How I Met Your Mother (TV series) 0.361
• Operating Systems 0.356
• Compilers 0.355
• Google Chrome 0.354
• Management & Organizational Leadership 0.35
• Literary Fiction 0.35
• Intelligence 0.348
• Fight Club (1999 movie) 0.344
• Hip Hop Music 0.34
• UX Design 0.337
• Web Application Frameworks 0.336
• Startups in New York City 0.333
• Book Recommendations 0.33
• Engineering Recruiting 0.33
• Search Engines 0.329
• Social Search 0.329
• Data Science 0.328
• History 0.328
• Interaction Design 0.326
• Classification (machine learning) 0.322
• Startup Incubators and Seed Programs 0.321
• Graphic Design 0.321
• Product Design (software) 0.319
• The College Experience 0.319
• Writing 0.319
• MapReduce 0.318
• Database Systems 0.315
• User Interfaces 0.314
• Literature 0.314
• C (programming language) 0.314
• Television 0.314
• Reading 0.313
• Usability 0.312
• Books 0.312
• Computers 0.311
• Stealth Startups 0.311
• Daft Punk 0.31
• Healthy Eating 0.309
• Innovation 0.309
• Skiing 0.305
• JavaScript 0.304
• Rock Music 0.304
• Mozilla Firefox 0.304
• Self-Improvement 0.303
• McKinsey & Company 0.302
• AngelList 0.301
• Data Visualization 0.301
• Cassandra (database) 0.301

Stanford

Topics are followed by p(school = Stanford|topic).

• Stanford Computer Science 0.951
• Stanford Graduate School of Business 0.939
• Stanford 0.896
• Stanford Football 0.896
• Stanford Cardinal 0.896
• Social Dance 0.847
• Stanford University Courses 0.847
• Romance 0.769
• Instagram 0.745
• College Football 0.665
• Mobile Location Applications 0.634
• Online Communities 0.621
• Interpersonal Relationships 0.585
• Food & Restaurants in Palo Alto 0.572
• Your 20s 0.566
• Men’s Fashion 0.548
• Flipboard 0.537
• Inception (2010 movie) 0.535
• Tumblr 0.531
• People Skills 0.522
• Exercise 0.52
• Joel Spolsky 0.516
• Valuations 0.515
• The Social Network (2010 movie) 0.513
• LeBron James 0.506
• Northern California 0.506
• Evernote 0.5
• Quora Community 0.5
• Blogging 0.49
• Downtown Palo Alto 0.487
• The College Experience 0.485
• Consumer Internet 0.477
• Restaurants in San Francisco 0.477
• Chad Hurley 0.47
• Meditation 0.468
• Yishan Wong 0.466
• Arrested Development (TV series) 0.463
• fbFund 0.457
• Best Engineers at X Company 0.451
• Language 0.45
• Words 0.448
• Happiness 0.447
• Path (company) 0.446
• Color Labs (startup) 0.446
• Palo Alto 0.445
• Woot.com 0.442
• Beer 0.442
• PayPal 0.441
• Women in Startups 0.438
• Techmeme 0.433
• Women in Engineering 0.428
• The Mission (San Francisco neighborhood) 0.427
• iPhone Applications 0.416
• Asana 0.413
• Monetization 0.412
• Repetitive Strain Injury (RSI) 0.4
• IDEO 0.398
• Spotify 0.397
• San Francisco Giants 0.396
• Fortune Magazine 0.389
• Love 0.387
• Human-Computer Interaction 0.382
• Hip Hop Music 0.378
• Self-Improvement 0.378
• Food in San Francisco 0.375
• Quora (company) 0.374
• Quora Infrastructure 0.373
• iPhone 0.371
• Square (company) 0.369
• Social Psychology 0.369
• Network Effects 0.366
• Chris Sacca 0.365
• Walt Mossberg 0.364
• Salesforce.com 0.362
• Sex 0.361
• Etiquette 0.361
• David Pogue 0.361
• Gowalla 0.36
• iOS Development 0.354
• Palantir Technologies 0.353
• Mobile Computing 0.347
• Sports 0.346
• Video Games 0.345
• Burning Man 0.345
• Engineering Management 0.343
• Cognitive Science 0.342
• Dating & Relationships 0.341
• Fred Wilson (venture investor) 0.337
• Taiwan 0.333
• Natural Language Processing 0.33
• Eric Schmidt 0.329
• Social Advice 0.329
• Engineering Recruiting 0.328
• Job Interviews 0.325
• Mobile Phones 0.324
• Twitter Inc. (company) 0.321
• Engineering in Silicon Valley 0.321
• San Francisco Bay Area 0.321
• Google Analytics 0.32
• Fashion 0.315
• Interaction Design 0.314
• Open Graph 0.313
• Drugs & Pharmaceuticals 0.312
• Electronic Music 0.312
• Facebook Inc. (company) 0.309
• Fitness 0.309
• YouTube 0.308
• TED Talks 0.308
• Freakonomics (2005 Book) 0.307
• Jack Dorsey 0.306
• Nutrition 0.305
• Puzzles 0.305
• Silicon Valley Mergers & Acquisitions 0.304
• Viral Growth & Analytics 0.304
• Amazon Web Services 0.304
• StumbleUpon 0.303
• Exceptional Comment Threads 0.303

Harvard

• Harvard Business School 0.968
• Harvard Business Review 0.922
• Harvard Square 0.912
• Harvard Law School 0.912
• Jimmy Fallon 0.899
• Boston Red Sox 0.658
• Klout 0.644
• Oprah Winfrey 0.596
• Ivanka Trump 0.587
• Dalai Lama 0.569
• Food in New York City 0.565
• U2 0.562
• TwitPic 0.534
• 37signals 0.522
• David Lynch (director) 0.512
• Al Gore 0.508
• TechStars 0.49
• Baseball 0.487
• Private Equity 0.471
• Classical Music 0.46
• Startups in New York City 0.458
• HootSuite 0.449
• Kiva 0.442
• Ultimate Frisbee 0.441
• Huffington Post 0.436
• New York City 0.433
• Charlie Cheever 0.433
• The New York Times 0.431
• Technology Journalism 0.431
• McKinsey & Company 0.427
• TweetDeck 0.422
• How Does X Work? 0.417
• Ashton Kutcher 0.414
• Coldplay 0.402
• Conan O’Brien 0.397
• Fast Company 0.397
• WikiLeaks 0.394
• Michael Jackson 0.389
• Guy Kawasaki 0.389
• Journalism 0.384
• Wall Street Journal 0.384
• Cambridge 0.371
• Seattle 0.37
• Cities & Metro Areas 0.357
• Boston 0.353
• Tim Ferriss (author) 0.35
• The New Yorker 0.343
• Law 0.34
• Mashable 0.338
• Politics 0.335
• The Economist 0.334
• Barack Obama 0.333
• Skiing 0.329
• McKinsey Quarterly 0.325
• Wired (magazine) 0.316
• Bill Gates 0.31
• Mad Men (TV series) 0.308
• India 0.306
• TED Talks 0.306
• Netflix 0.304
• Wine 0.303
• Angel Investors 0.302
• Facebook Ads 0.301

UC Berkeley

• Berkeley 0.978
• California Golden Bears 0.91
• Internships 0.717
• Web Marketing 0.484
• Google Social Strategy 0.453
• Southwest Airlines 0.451
• WordPress 0.429
• Stock Market 0.429
• BMW (automobile) 0.428
• Web Applications 0.423
• Flickr 0.422
• Snowboarding 0.42
• Electronic Music 0.404
• MySQL 0.401
• Internet Advertising 0.399
• Search Engine Optimization (SEO) 0.398
• Yelp 0.396
• Groupon 0.393
• In-N-Out Burger 0.391
• The Matrix (1999 movie) 0.389
• Trading (finance) 0.385
• jQuery 0.381
• Hedge Funds 0.378
• Social Media Marketing 0.377
• San Francisco 0.376
• Stealth Startups 0.362
• Yahoo! 0.36
• Cascading Style Sheets 0.359
• Angel Investors 0.355
• UX Design 0.35
• StarCraft 0.348
• Los Angeles Lakers 0.347
• Mountain View 0.345
• How I Met Your Mother (TV series) 0.338
• Google+ 0.337
• Ruby on Rails 0.333
• Reading 0.333
• Social Media 0.326
• China 0.322
• Palantir Technologies 0.319
• Facebook Platform 0.315
• Basketball 0.315
• Education 0.314
• Business Development 0.312
• Online & Mobile Payments 0.305
• Restaurants in San Francisco 0.302
• Technology Companies 0.302
• Seth Godin 0.3

Caltech

• Pasadena 0.969
• Chess 0.748
• Table Tennis 0.671
• UCLA 0.67
• MacBook Pro 0.618
• Physics 0.618
• Haskell 0.582
• Los Angeles 0.58
• Electrical Engineering 0.567
• Star Trek (movie 0.561
• Disruptive Technology 0.545
• Science 0.53
• Biology 0.526
• Quantum Mechanics 0.521
• LaTeX 0.514
• Mathematics 0.488
• xkcd 0.488
• Genetics & Heredity 0.487
• Chemistry 0.47
• Medicine & Healthcare 0.448
• Poker 0.445
• C++ (programming language) 0.442
• Data Structures 0.434
• Emacs 0.428
• MongoDB 0.423
• Neuroscience 0.404
• Science Fiction 0.4
• Mac OS X 0.394
• Board Games 0.387
• Computers 0.386
• Research 0.385
• Finance 0.385
• The Future 0.379
• Linux 0.378
• The Colbert Report 0.376
• The Beatles 0.374
• The Onion 0.365
• Ruby 0.363
• Cars & Automobiles 0.361
• Quantitative Finance 0.359
• Academia 0.359
• Law 0.355
• Cooking 0.354
• Psychology 0.349
• Eminem 0.347
• Football (Soccer) 0.346
• Computer Programming 0.343
• Algorithms 0.343
• Evolutionary Biology 0.337
• Behavioral Economics 0.335
• California 0.329
• Machine Learning 0.326
• Futurama 0.324
• Social Advice 0.324
• StarCraft II 0.319
• Job Interview Questions 0.318
• Game Theory 0.316
• This American Life 0.315
• Economics 0.314
• Vim 0.31
• Graduate School 0.309
• Git (revision control) 0.306
• Computer Science 0.303

What do we see?

• First, in a nice validation of this approach, we find that each school is interested in exactly the locations we’d expect: Caltech is interested in Pasadena and Los Angeles; MIT and Harvard are both interested in Boston and Cambridge (Harvard is interested in New York City as well); Stanford is interested in Palo Alto, Northern California, and San Francisco Bay Area; and Berkeley is interested in Berkeley, San Francisco, and Mountain View.
• More interestingly, let’s look at where each school likes to eat. Stereotypically, we expect Harvard, Stanford, and Berkeley students to be more outgoing and social, and MIT and Caltech students to be more introverted. This is indeed what we find:
• Harvard follows Food in New York City; Stanford follows Food & Restaurants in Palo Alto, Restaurants in San Francisco, and Food in San Francisco; and Berkeley follows Restaurants in San Francisco and In-N-Out Burger. In other words, Harvard, Stanford, and Berkeley love eating out.
• Caltech, on the other hand, loves Cooking, and MIT loves Healthy Eating – both signs, perhaps, of a preference for eating in.
• And what does each university use to quench their thirst? Harvard students like to drink wine (classy!), while Stanford students prefer beer (the social drink of choice).
• What about sports teams? MIT and Caltech couldn’t care less, though Harvard follows the Boston Red Sox, Stanford follows the San Francisco Giants (as well as their own Stanford Football and Stanford Cardinal), and Berkeley follows the Los Angeles Lakers (and the California Golden Bears).
• For sports themselves, MIT students like skiing; Stanford students like general exercise, fitness, and sports; Harvard students like baseball, ultimate frisbee, and skiing; and Berkeley students like snowboarding. Caltech, in a league of its own, enjoys table tennis and chess.
• What does each school think of social? Caltech students look for Social Advice. Berkeley students are interested in Social Media and Social Media Marketing. MIT, on the more technical side, wants Social Search. Stanford students, predictably, love the whole spectrum of social offerings, from Social Dance and The Social Network, to Social Psychology and Social Advice. (Interestingly, Caltech and Stanford are both interested in Social Advice, though I wonder if it’s for slightly different reasons.)
• What’s each school’s relationship with computers? Caltech students are interested in Computer Science, MIT hackers are interested in Computer Security, and Stanford students are interested in Human-Computer Interaction.
• Digging into the MIT vs. Caltech divide a little, we see that Caltech students really are more interested in the pure sciences (Physics, Science, Biology, Quantum Mechanics, Mathematics, Chemistry, etc.), while MIT students are more on the applied and engineering sides (Mechanical Engineering, Engineering, Distributed Databases, Cryptography, Computer Security, Biotechnology, Operating Systems, Compilers, etc.).
• Regarding programming languages, Caltech students love Haskell (hardcore purity!), while MIT students love Lisp.
• What does each school like to read, both offline and online? Caltech loves science fiction, xkcd, and The Onion; MIT likes Hacker News; Harvard loves journals, newspapers, and magazines (Huffington Post, the New York Times, Fortune, Wall Street Journal, the New Yorker, the Economist, and so on); and Stanford likes TechMeme.
• What movies and television shows does each school like to watch? Caltech likes Star Trek, the Colbert Report, and Futurama. MIT likes Fight Club (I don’t know what this has to do with MIT, though I will note that on my first day as a freshman in a new dorm, Fight Club was precisely the movie we all went to a lecture hall to see). Stanford likes The Social Network and Inception. Harvard, rather fittingly, likes Mad Men and Ted Talks.
• Let’s look at the startups each school follows. MIT, of course, likes Ksplice. Berkeley likes Yelp and Groupon. Stanford likes just about every startup under the sun (Instagram, Flipboard, Tumblr, Path, Color Labs, etc.). And Harvard, that bastion of hard-won influence and prestige? To the surprise of precisely no one, Harvard enjoys Klout.

Let’s end with a summarized view of each school:

• Caltech is very much into the sciences (Physics, Biology, Quantum Mechanics, Mathematics, etc.), as well as many pretty nerdy topics (Star Trek, Science Fiction, xkcd, Futurama, Starcraft II, etc.).
• MIT is dominated by everything engineering and tech.
• Stanford loves relationships (interpersonal relationships, people skills, love, network effects, sex, etiquette, dating and relationships, romance), health and appearance (fashion, fitness, nutrition, happiness), and startups (Instagram, Flipboard, Path, Color Labs, etc.).
• Berkeley, sadly, is perhaps too large and diverse for an overall characterization.
• Harvard students are fascinated by famous figures (Jimmy Fallon, Oprah Winfrey, Invaka Trump, Dalai Lama, David Lynch, Al Gore, Bill Gates, Barack Obama), and by prestigious newspapers, journals, and magazines (Fortune, the New York Times, the Wall Street Journal, the Economist, and so on). Other very fitting interests include Kiva, classical music, and Coldplay.

*I pulled about 400 followers from each school, and added a couple filters, to try to ensure that followers were actual attendees of the schools rather than general people simply interested in them. Topics are sorted using a naive Bayes score and filtered to have at least 5 counts. Also, a word of warning: my dataset was fairly small and users on Quora are almost certainly not representative of their schools as a whole (though I tried to be rigorous with what I had).

Information Transmission in a Social Network: Dissecting the Spread of a Quora Post

tl;dr See this movie visualization for a case study on how a post propagates through Quora.

How does information spread through a network? Much of Quora’s appeal, after all, lies in its social graph – and when you’ve got a network of users, all broadcasting their activities to their neighbors, information can cascade in multiple ways. How do these social designs affect which users see what?

Think, for example, of what happens when your kid learns a new slang word at school. He doesn’t confine his use of the word to McKinley Elementary’s particular boundaries, between the times of 9-3pm – he introduces it to his friends from other schools at soccer practice as well. A couple months later, he even says it at home for the first time; you like the word so much, you then start using it at work. Eventually, Justin Bieber uses the word in a song, at which point the word’s popularity really starts to explode.

So how does information propagate through a social network? What types of people does an answer on Quora reach, and how does it reach them? (Do users discover new answers individually, or are hubs of connectors more key?) How does the activity of a post on Quora rise and fall? (Submissions on other sites have limited lifetimes, fading into obscurity soon after an initial spike; how does that change when users are connected and every upvote can revive a post for someone else’s eyes?)

(I looked at Quora since I had some data from there already available, but I hope the lessons should be fairly applicable in general, to other social networks like Facebook, Twitter, and LinkedIn as well.)

To give an initial answer to some of these questions, I dug into one of my more popular posts, on a layman’s introduction to random forests.

Users, Topics

Before looking deeper into the voting dynamics of the post, let’s first get some background on what kinds of users the answer reached.

Here’s a graph of the topics that question upvoters follow. (Each node is a topic, and every time upvoter X follows both topics A and B, I add an edge between A and B.)

We can see from the graph that upvoters tend to be interested in three kinds of topics:

• Machine learning and other technical matters (the green cluster): Classification, Data Mining, Big Data, Information Retrieval, Analytics, Probability, Support Vector Machines, R, Data Science, …
• Startups/Silicon Valley (the red cluster): Facebook, Lean Startups, Investing, Seed Funding, Angel Investing, Technology Trends, Product Managment, Silicon Valley Mergers and Acquisitions, Asana, Social Games, Quora, Mark Zuckerberg, User Experience, Founders and Entrepreneurs, …
• General Intellectual Topics (the purple cluster): TED, Science, Book Recommendations, Philosophy, Politics, Self-Improvement, Travel, Life Hacks, …

Also, here’s the network of the upvoters themselves (there’s an edge between users A and B if A follows B):

We can see three main clusters of users:

• A large group in green centered around a lot of power users and Quora employees.
• A machine learning group of folks in orange centered around people like Oliver Grisel, Christian Langreiter, and Joseph Turian.
• A group of people following me, in purple.
• Plus some smaller clusters in blue and yellow. (There were also a bunch of isolated users, connected to no one, that I filtered out of the picture.)

Digging into how these topic and user graphs are related:

• The orange cluster of users is more heavily into machine learning: 79% of users in that cluster follow more green topics (machine learning and technical topics) than red and purple topics (startups and general intellectual matters).
• The green cluster of users is reversed: 77% of users follow more of the red and purple clusters of topics (on startups and general intellectual matters) than machine learning and technical topics.

More interestingly, though, we can ask: how do the connections between upvoters relate to the way the post spread?

Social Voting Dynamics

So let’s take a look. Here’s a visualization I made of upvotes on my answer across time (click here for a larger view).

To represent the social dynamics of these upvotes, I drew an edge from user A to user B if user A transmitted the post to user B through an upvote. (Specifically, I drew an edge from Alice to Bob if Bob follows Alice and Bob’s upvote appeared within five days of Alice’s upvote; this is meant to simulate the idea that Alice was the key intermediary between my post and Bob.)

Also,

• Green nodes are users with at least one upvote edge.
• Blue nodes are users who follow at least one of the topics the post is categorized under (i.e., users who probably discovered the answer by themselves).
• Red nodes are users with no connections and who do not follow any of the post’s topics (i.e, users whose path to the post remain mysterious).
• Users increase in size when they produce more connections.

Here’s a play-by-play of the video:

• On Feb 14 (the day I wrote the answer), there’s a flurry of activity.
• A couple of days later, Tracy Chou gives an upvote, leading to another spike in activity.
• Then all’s quiet until… bam! Alex Kamil leads to a surge of upvotes, and his upvote finds Ludi Rehak, who starts a small surge of her own. They’re quickly followed by Christian Langreiter, who starts a small revolution among a bunch of machine learning folks a couple days later.
• Then all is pretty calm again, until a couple months later when… bam! Aditya Sengupta brings in a smashing of his own followers, and his upvote makes its way to Marc Bodnick, who sets off a veritable storm of activity.

(Already we can see some relationships between the graph of user connections and the way the post propagated. Many of the users from the orange cluster, for example, come from Alex Kamil and Christian Langreiter’s upvotes, and many of the users from the green cluster come from Aditya Sengupta and Marc Bodnick’s upvotes. What’s interesting, though, is, why didn’t the cluster of green users appear all at once, like the orange cluster did? People like Kah Seng Tay, Tracy Chou, Venkatesh Rao, and Chad Little upvoted the answer pretty early on, but it wasn’t until Aditya Sengupta’s upvote a couple months later that people like Marc Bodnick, Edmond Lau, and many of the other green users (who do indeed follow that first set of folks) discovered the answer. Did the post simply get lost in users’ feeds the first time around? Was the post perhaps ignored until it received enough upvotes to be considered worth reading? Are some users’ upvotes just trusted more than others’?)

For another view of the upvote dynamics, here’s a static visualization, where we can again easily see the clusters of activity:

Fin

There are still many questions it would be interesting to look at; for example,

• What differentiates users who sparked spikes of activity from users who didn’t? I don’t believe it’s simply number of followers, as many well-connected upvoters did not lead to cascades of shares. Does authority matter?
• How far can a post reach? Clearly, the post reached people more than one degree of separation away from me (where one degree of separation is a follower); what does the distribution of degrees look like? Is there any relationship between degree of separation and time of upvote?
• What can we say about the people who started following me after reading my answer? Are they fewer degrees of separation away? Are they more interested in machine learning? Have they upvoted any of my answers before? (Perhaps there’s a certain “threshold” of interestingness people need to overflow before they’re considered acceptable followees.)

But to summarize a bit what we’ve seen so far, here are some statistics on the role the social graph played in spreading the post:

• There are 5 clusters of activity after the initial post, sparked both by power users and less-connected folks. In an interesting cascade of information, some of these sparks led to further spikes in activity as well (as when Aditya Sengupta’s upvote found its way to Marc Bodnick, who set off even more activity).
• 35% of users made their way to my answer because of someone else’s upvote.
• Through these connections, the post reached a fair variety of users: 32% of upvoters don’t even follow any of the post’s topics.
• 77% of upvotes came from users over two weeks after my answer appeared.
• If we look only at the upvoters who follow at least one of the post’s topics, 33% didn’t see my answer until someone else showed it to them. In other words, a full one-third of people who presumably would have been interested in my post anyways only found it because of their social network.

So it looks like the social graph played quite a large part in the post’s propagation, and I’ll end with a big shoutout to Stormy Shippy, who provided an awesome set of scripts I used to collect a lot of this data.

Introduction

Suppose you have the following set of sentences:

• I like to eat broccoli and bananas.
• I ate a banana and spinach smoothie for breakfast.
• Chinchillas and kittens are cute.
• My sister adopted a kitten yesterday.
• Look at this cute hamster munching on a piece of broccoli.

What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

• Sentences 1 and 2: 100% Topic A
• Sentences 3 and 4: 100% Topic B
• Sentence 5: 60% Topic A, 40% Topic B
• Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)
• Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

The question, of course, is: how does LDA perform this discovery?

LDA Model

In more detail, LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you

• Decide on the number of words N the document will have (say, according to a Poisson distribution).
• Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.
• Generate each word w_i in the document by:
• First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).
• Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.

Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.

Example

Let’s make an example. According to the above process, when generating some particular document D, you might

• Pick 5 to be the number of words in D.
• Decide that D will be 1/2 about food and 1/2 about cute animals.
• Pick the first word to come from the food topic, which then gives you the word “broccoli”.
• Pick the second word to come from the cute animals topic, which gives you “panda”.
• Pick the third word to come from the cute animals topic, giving you “adorable”.
• Pick the fourth word to come from the food topic, giving you “cherries”.
• Pick the fifth word to come from the food topic, giving you “eating”.

So the document generated under the LDA model will be “broccoli panda adorable cherries eating” (note that LDA is a bag-of-words model).

Learning

So now suppose you have a set of documents. You’ve chosen some fixed number of K topics to discover, and want to use LDA to learn the topic representation of each document and the words associated to each topic. How do you do this? One way (known as collapsed Gibbs sampling) is the following:

• Go through each document, and randomly assign each word in the document to one of the K topics.
• Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
• So to improve on them, for each document d…
• Go through each word w in d…
• And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability). (Also, I’m glossing over a couple of things here, in particular the use of priors/pseudocounts in these probabilities.)
• In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
• After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

Layman’s Explanation

In case the discussion above was a little eye-glazing, here’s another way to look at LDA in a different domain.

Suppose you’ve just moved to a new city. You’re a hipster and an anime fan, so you want to know where the other hipsters and anime geeks tend to hang out. Of course, as a hipster, you know you can’t just ask, so what do you do?

Here’s the scenario: you scope out a bunch of different establishments (documents) across town, making note of the people (words) hanging out in each of them (e.g., Alice hangs out at the mall and at the park, Bob hangs out at the movie theater and the park, and so on). Crucially, you don’t know the typical interest groups (topics) of each establishment, nor do you know the different interests of each person.

So you pick some number K of categories to learn (i.e., you want to learn the K most important kinds of categories people fall into), and start by making a guess as to why you see people where you do. For example, you initially guess that Alice is at the mall because people with interests in X like to hang out there; when you see her at the park, you guess it’s because her friends with interests in Y like to hang out there; when you see Bob at the movie theater, you randomly guess it’s because the Z people in this city really like to watch movies; and so on.

Of course, your random guesses are very likely to be incorrect (they’re random guesses, after all!), so you want to improve on them. One way of doing so is to:

• Pick a place and a person (e.g., Alice at the mall).
• Why is Alice likely to be at the mall? Probably because other people at the mall with the same interests sent her a message telling her to come.
• In other words, the more people with interests in X there are at the mall and the stronger Alice is associated with interest X (at all the other places she goes to), the more likely it is that Alice is at the mall because of interest X.
• So make a new guess as to why Alice is at the mall, choosing an interest with some probability according to how likely you think it is.

Go through each place and person over and over again. Your guesses keep getting better and better (after all, if you notice that lots of geeks hang out at the bookstore, and you suspect that Alice is pretty geeky herself, then it’s a good bet that Alice is at the bookstore because her geek friends told her to go there; and now that you have a better idea of why Alice is probably at the bookstore, you can use this knowledge in turn to improve your guesses as to why everyone else is where they are), and eventually you can stop updating. Then take a snapshot (or multiple snapshots) of your guesses, and use it to get all the information you want:

• For each category, you can count the people assigned to that category to figure out what people have this particular interest. By looking at the people themselves, you can interpret the category as well (e.g., if category X contains lots of tall people wearing jerseys and carrying around basketballs, you might interpret X as the “basketball players” group).
• For each place P and interest category C, you can compute the proportions of people at P because of C (under the current set of assignments), and these give you a representation of P. For example, you might learn that the people who hang out at Barnes & Noble consist of 10% hipsters, 50% anime fans, 10% jocks, and 30% college students.

Real-World Example

Finally, I applied LDA to a set of Sarah Palin’s emails a little while ago (see here for the blog post, or here for an app that allows you to browse through the emails by the LDA-learned categories), so let’s give a brief recap. Here are some of the topics that the algorithm learned:

• Trig/Family/Inspiration: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, …
• Wildlife/BP Corrosion: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, …
• Energy/Fuel/Oil/Mining: energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, …
• Gas: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, …
• Education/Waste: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, …
• Presidential Campaign/Elections: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, …

Here’s an example of an email which fell 99% into the Trig/Family/Inspiration category (particularly representative words are highlighted in blue):

And here’s an excerpt from an email which fell 10% into the Presidential Campaign/Election category (in red) and 90% into the Wildlife/BP Corrosion category (in green):

Tweets vs. Likes: What gets shared on Twitter vs. Facebook?

It always strikes me as curious that some posts get a lot of love on Twitter, while others get many more shares on Facebook:

What accounts for this difference? Some of it is surely site-dependent: maybe one blogger has a Facebook page but not a Twitter account, while another has these roles reversed. But even on sites maintained by a single author, tweet-to-likes ratios can vary widely from post to post.

So what kinds of articles tend to be more popular on Twitter, and which spread more easily on Facebook? To take a stab at an answer, I scraped data from a couple of websites over the weekend.

tl;dr Twitter is still for the techies: articles where the number of tweets greatly outnumber FB likes tend to revolve around software companies and programming. Facebook, on the other hand, appeals to everyone else: yeah, to the masses, and to non-software technical folks in general as well.

FlowingData

The first site I looked at was Nathan Yau’s awesome FlowingData website on data visualization. To see which articles are more popular on Facebook and which are more popular on Twitter, let’s sort all the FlowingData articles by their # tweets / # likes ratio.

Here are the 10 posts with the lowest tweets-to-likes ratio (i.e., the posts that were especially popular with Facebook users):

And here are the 10 posts with the highest tweets-to-like ratio (i.e., the posts especially popular with Twitter users):

Notice any differences between the two?

• Instant gratification infographics, cuteness, comics, and pop culture get liked on Facebook.
• APIs, datasets, visualizations related to techie sites (Delicious, foursquare, Twitter, LinkedIn), and picture-less articles get tweeted instead.

Interestingly, it also looks like the colors in the top 10 Facebook articles tend to the red end of the spectrum, while the colors in the top 10 Twitter articles tend to the blue end of the spectrum. Does this pattern hold if we look at more data? Here’s a meta-visualization of the FlowingData articles, sorted by articles popular on Facebook in the top left to articles popular on Twitter in the bottom right (see here for some interactivity and more details):

It does indeed look like the images at the top (the articles popular on Facebook) are more pink, while the images at the bottom (the articles popular on Twitter) are more blue (though it would be nice to quantify this in some way)!

Furthermore, we can easily see from the grid that articles with no visualizations (represented by lorem ipsum text in the grid) cluster at the bottom. Grabbing some actual numbers, we find that 32% of articles with at least one picture have more shares on Facebook than on Twitter, compared to only 4% of articles with no picture at all.

Finally, let’s break down the percentage of articles with more Facebook shares by category.

(I filtered the categories so that each category in the plot above contains at least 5 articles.)

What do we find?

• Articles in the Software, Online Applications, News, and Data sources categories (yawn) get 100% of their shares from Twitter.
• Articles tagged with Data Underload (which seems to contain short and sweet visualizations of everyday things), Miscellaneous (which contains lots of comics or comic-like visualizations), and Infographics get the most shares on Facebook.
• This category breakdown matches precisely what we saw in the top 10 examples above.

New Scientist

When looking at FlowingData, we saw that Twitter users are much bigger on sharing technical articles. But is this true for technical articles in general, or only for programming-related posts? (In my experience with Twitter, I haven’t seen many people from math and the non-computer sciences.)

To answer, I took articles from the Physics & Math and Technology sections of New Scientist, and

• Calculated the percentage of shares each article received on Twitter (i.e., # tweets / (# tweets + # likes)).
• Grouped articles by their number of tweets rounded to the nearest multiple of 25 (bin #1 contains articles close to 25 tweets, bin #2 contains articles close to 50 tweets, etc.).
• Calculated the median percentage of shares on Twitter for each bin.

Here’s a graph of the result:

Notice that:

• The technology articles get consistently more shares from Twitter than the physics and math articles do.
• Twitter accounts for the majority of the technology shares.
• Facebook accounts for the majority of the physics and math shares.

So this suggests that Twitter really is for computer technology in particular, not technical matters in general (though it would be nice to look at areas other than physics and math as well).

Quora

To get some additional evidence on the computer science vs. math/physics divide, I

• Scraped about 350 profiles of followers from each of the Computer Science, Software Engineering, Mathematics, and Physics categories on Quora;
• Checked each user to see whether they link to their Facebook and Twitter accounts on their profile.

Here’s the ratio of the number of people linking to their Facebook account to the number of people linking to their Twitter account, sliced by topic:

We find exactly what we expect from the New Scientist data: people following the math and physics categories have noticeably smaller Twitter / Facebook ratios compared to people following the computer science and software engineering categories (i.e., compared to computer scientists and software engineers, mathematicians and physicists are more likely to be on Facebook than on Twitter). What’s more, this difference is in fact significant: the graphs display individual 90% confidence intervals (which overlap not at all or only slightly), and we do indeed get significance at the 95% level if we look at the differences between categories.

This corroborates the New Scientist evidence that Twitter gets the computer technology shares, while Facebook gets the math and physics shares.

XKCD

Finally, let’s take a look at which XKCD comics are especially popular on Facebook vs. Twitter.

Here are the 10 comics with the highest likes-to-tweets ratio (i.e., the comics especially popular on Facebook):

Here are the 10 comics with the highest tweets-to-likes ratio (i.e., the comics especially popular on Twitter):

Note that the XKCD comics popular on Facebook have more of a layman flavor, while the XKCD comics popular on Twitter are much more programming-related:

• Of the XKCD comics popular on Twitter, one’s about server attention spans, another’s about IPv6 addresses, a third is about GNU info pages, another deals with cloud computing, a fifth talks about Java, and the last is about a bunch of techie sites. (This is just like what we saw with the FlowingData visualizations.)
• Facebook, on the other hand, gets Ke$ha and Magic School Bus. • And while both top 10’s contain a flowchart, the one popular on FB is about cooking, while the one popular on Twitter is about code! • What’s more, if we look at the few technical-ish comics that are more popular on Facebook (the complex conjugate, mu, and Los Alamos comics), we see that they’re about physics and math, not programming (which matches our findings from the New Scientist articles). Lesson So why should you care? Here’s one takeaway: • If you’re blogging about technology, programming, and computer science, Twitter is your friend. • But if you’re blogging about anything else, be it math/physics or pop culture, don’t rely on a Twitter account alone; your shares are more likely to propagate on Facebook, so make sure to have a Facebook page as well. What’s Next? The three websites I looked at are all fairly tech-oriented, so it would be nice to gather data from other kinds of websites as well. And now that we have an idea how Twitter and Facebook compare, the next burning question is surely: what do people share on Google+?! Addendum Let’s consider the following thought experiment. Suppose you come across the most unpopular article ever written. What will its FB vs. Twitter shares look like? Although no real person will ever share this article, I think Twitter has many more spambots (who tweet out any and every link) than FB does, so maybe unpopular articles will have more tweets than likes by default. Conversely, suppose you come across the most popular article ever written, which everybody wants to share. Then since FB has many more users than Twitter does, maybe popular articles will tend to have more likes than tweets anyways. Thus, in order to find out which types of articles are especially popular on FB vs. Twitter, instead of looking at tweets-to-likes ratios directly, we could try to remove this baseline popularity effect. (Taking ratios instead of raw number of tweets or raw number of likes is one kind of normalization; this is another.) So does this scenario (or something similar to it) actually play out in practice? Here I’ve plotted the overall popularity of a post (the total number of shares it received on either Twitter or FB) against the percentage of shares on Facebook alone, and we can see that as a post’s popularity grows, more and more shares do indeed tend to come from Facebook rather than Twitter. Also, see the posts at the lower end of the popularity scale that are only getting shares on Twitter? Let’s take a look at the five most unpopular of these: Notice that they’re all shoutouts to FlowingData’s sponsors! There’s pretty much no reason any real person would share these on Twitter or Facebook, and indeed, checking Twitter to see who actually tweeted out these links, we see that the tweeters are bots: Now let’s switch to a slightly different view of the above scenario, where I plot number of tweets against number of likes: We see that as popularity on Twitter increases, so too does popularity on Facebook – but at a slightly faster rate. (The form of the blue line plotted is roughly$\log(likes) = -3.87 + 1.70 \log(tweets)$.) So instead of looking at the ratios above, to figure out which articles are popular on FB vs. Twitter, we could look at the residuals of the above plot. Posts with large positive residuals would be posts that are especially popular on FB, and posts with negative residuals would be posts that are especially popular on Twitter. In practice, however, there wasn’t much difference between looking at residuals vs. ratios directly when using the datasets I had, so to keep things simple in the main discussion above, I stuck to ratios alone. Still, it’s another option which might be useful when looking at different questions or different sources of data, so just for completeness, here’s what the FlowingData results look like if we use residuals instead. The 10 articles with the highest residuals (i.e., the articles most popular on Facebook): The 10 articles with the lowest residuals (i.e., the articles most popular on Twitter): Here’s a density plot of article residuals, split by whether the article has a visualization or not (residuals of picture-free articles are clearly shifted towards the negative end): Here are the mean residuals per category (again, we see that the miscellaneous, data underload, data art, and infographics categories tend to be more popular on Facebook, while the data sources, software, online applications, and news categories tend to be more popular on Twitter): And that’s it! In the spirit of these findings, I hope this article gets liked a little and tweeted lots and lots. Introduction to Restricted Boltzmann Machines Suppose you ask a bunch of users to rate a set of movies on a 0-100 scale. In classical factor analysis, you could then try to explain each movie and user in terms of a set of latent factors. For example, movies like Star Wars and Lord of the Rings might have strong associations with a latent science fiction and fantasy factor, and users who like Wall-E and Toy Story might have strong associations with a latent Pixar factor. Restricted Boltzmann Machines essentially perform a binary version of factor analysis. (This is one way of thinking about RBMs; there are, of course, others, and lots of different ways to use RBMs, but I’ll adopt this approach for this post.) Instead of users rating a set of movies on a continuous scale, they simply tell you whether they like a movie or not, and the RBM will try to discover latent factors that can explain the activation of these movie choices. More technically, a Restricted Boltzmann Machine is a stochastic neural network (neural network meaning we have neuron-like units whose binary activations depend on the neighbors they’re connected to; stochastic meaning these activations have a probabilistic element) consisting of: • One layer of visible units (users’ movie preferences whose states we know and set); • One layer of hidden units (the latent factors we try to learn); and • A bias unit (whose state is always on, and is a way of adjusting for the different inherent popularities of each movie). Furthermore, each visible unit is connected to all the hidden units (this connection is undirected, so each hidden unit is also connected to all the visible units), and the bias unit is connected to all the visible units and all the hidden units. To make learning easier, we restrict the network so that no visible unit is connected to any other visible unit and no hidden unit is connected to any other hidden unit. For example, suppose we have a set of six movies (Harry Potter, Avatar, LOTR 3, Gladiator, Titanic, and Glitter) and we ask users to tell us which ones they want to watch. If we want to learn two latent units underlying movie preferences – for example, two natural groups in our set of six movies appear to be SF/fantasy (containing Harry Potter, Avatar, and LOTR 3) and Oscar winners (containing LOTR 3, Gladiator, and Titanic), so we might hope that our latent units will correspond to these categories – then our RBM would look like the following: (Note the resemblance to a factor analysis graphical model.) State Activation Restricted Boltzmann Machines, and neural networks in general, work by updating the states of some neurons given the states of others, so let’s talk about how the states of individual units change. Assuming we know the connection weights in our RBM (we’ll explain how to learn these below), to update the state of unit$i$: • Compute the activation energy$a\_i = \sum\_j w\_{ij} x\_j$of unit$i$, where the sum runs over all units$j$that unit$i$is connected to,$w\_{ij}$is the weight of the connection between$i$and$j$, and$x\_j$is the 0 or 1 state of unit$j$. In other words, all of unit$i$’s neighbors send it a message, and we compute the sum of all these messages. • Let$p\_i = \sigma(a\_i)$, where$\sigma(x) = 1/(1 + exp(-x))$is the logistic function. Note that$p\_i$is close to 1 for large positive activation energies, and$p\_i$is close to 0 for negative activation energies. • We then turn unit$i$on with probability$p\_i$, and turn it off with probability$1 - p\_i$. • (In layman’s terms, units that are positively connected to each other try to get each other to share the same state (i.e., be both on or off), while units that are negatively connected to each other are enemies that prefer to be in different states.) For example, let’s suppose our two hidden units really do correspond to SF/fantasy and Oscar winners. • If Alice has told us her six binary preferences on our set of movies, we could then ask our RBM which of the hidden units her preferences activate (i.e., ask the RBM to explain her preferences in terms of latent factors). So the six movies send messages to the hidden units, telling them to update themselves. (Note that even if Alice has declared she wants to watch Harry Potter, Avatar, and LOTR 3, this doesn’t guarantee that the SF/fantasy hidden unit will turn on, but only that it will turn on with high probability. This makes a bit of sense: in the real world, Alice wanting to watch all three of those movies makes us highly suspect she likes SF/fantasy in general, but there’s a small chance she wants to watch them for other reasons. Thus, the RBM allows us to generate models of people in the messy, real world.) • Conversely, if we know that one person likes SF/fantasy (so that the SF/fantasy unit is on), we can then ask the RBM which of the movie units that hidden unit turns on (i.e., ask the RBM to generate a set of movie recommendations). So the hidden units send messages to the movie units, telling them to update their states. (Again, note that the SF/fantasy unit being on doesn’t guarantee that we’ll always recommend all three of Harry Potter, Avatar, and LOTR 3 because, hey, not everyone who likes science fiction liked Avatar.) Learning Weights So how do we learn the connection weights in our network? Suppose we have a bunch of training examples, where each training example is a binary vector with six elements corresponding to a user’s movie preferences. Then for each epoch, do the following: • Take a training example (a set of six movie preferences). Set the states of the visible units to these preferences. • Next, update the states of the hidden units using the logistic activation rule described above: for the$j$th hidden unit, compute its activation energy$a\_j = \sum\_i w\_{ij} x\_i$, and set$x\_j$to 1 with probability$\sigma(a\_j)$and to 0 with probability$1 - \sigma(a\_j)$. Then for each edge$e\_{ij}$, compute$Positive(e\_{ij}) = x\_i \* x\_j$(i.e., for each pair of units, measure whether they’re both on). • Now reconstruct the visible units in a similar manner: for each visible unit, compute its activation energy$a\_i$, and update its state. (Note that this reconstruction may not match the original preferences.) Then update the hidden units again, and compute$Negative(e\_{ij}) = x\_i \* x\_j$for each edge. • Update the weight of each edge$e\_{ij}$by setting$w\_{ij} = w\_{ij} + L \* (Positive(e\_{ij}) - Negative(e\_{ij}))$, where$L$is a learning rate. • Repeat over all training examples. Continue until the network converges (i.e., the error between the training examples and their reconstructions falls below some threshold) or we reach some maximum number of epochs. Why does this update rule make sense? Note that • In the first phase,$Positive(e\_{ij})$measures the association between the$i$th and$j$th unit that we want the network to learn from our training examples; • In the “reconstruction” phase, where the RBM generates the states of visible units based on its hypotheses about the hidden units alone,$Negative(e\_{ij})$measures the association that the network itself generates (or “daydreams” about) when no units are fixed to training data. So by adding$Positive(e\_{ij}) - Negative(e\_{ij})$to each edge weight, we’re helping the network’s daydreams better match the reality of our training examples. (You may hear this update rule called contrastive divergence, which is basically a funky term for “approximate gradient descent”.) Examples I wrote a simple RBM implementation in Python (the code is heavily commented, so take a look if you’re still a little fuzzy on how everything works), so let’s use it to walk through some examples. First, I trained the RBM using some fake data. • Alice: (Harry Potter = 1, Avatar = 1, LOTR 3 = 1, Gladiator = 0, Titanic = 0, Glitter = 0). Big SF/fantasy fan. • Bob: (Harry Potter = 1, Avatar = 0, LOTR 3 = 1, Gladiator = 0, Titanic = 0, Glitter = 0). SF/fantasy fan, but doesn’t like Avatar. • Carol: (Harry Potter = 1, Avatar = 1, LOTR 3 = 1, Gladiator = 0, Titanic = 0, Glitter = 0). Big SF/fantasy fan. • David: (Harry Potter = 0, Avatar = 0, LOTR 3 = 1, Gladiator = 1, Titanic = 1, Glitter = 0). Big Oscar winners fan. • Eric: (Harry Potter = 0, Avatar = 0, LOTR 3 = 1, Gladiator = 1, Titanic = 1, Glitter = 0). Oscar winners fan, except for Titanic. • Fred: (Harry Potter = 0, Avatar = 0, LOTR 3 = 1, Gladiator = 1, Titanic = 1, Glitter = 0). Big Oscar winners fan. The network learned the following weights:  Bias Unit Hidden 1 Hidden 2 Bias Unit -0.08257658 -0.19041546 1.57007782 Harry Potter -0.82602559 -7.08986885 4.96606654 Avatar -1.84023877 -5.18354129 2.27197472 LOTR 3 3.92321075 2.51720193 4.11061383 Gladiator 0.10316995 6.74833901 -4.00505343 Titanic -0.97646029 3.25474524 -5.59606865 Glitter -4.44685751 -2.81563804 -2.91540988  Note that the first hidden unit seems to correspond to the Oscar winners, and the second hidden unit seems to correspond to the SF/fantasy movies, just as we were hoping. What happens if we give the RBM a new user, George, who has (Harry Potter = 0, Avatar = 0, LOTR 3 = 0, Gladiator = 1, Titanic = 1, Glitter = 0) as his preferences? It turns the Oscar winners unit on (but not the SF/fantasy unit), correctly guessing that George probably likes movies that are Oscar winners. What happens if we activate only the SF/fantasy unit, and run the RBM a bunch of different times? In my trials, it turned on Harry Potter, Avatar, and LOTR 3 three times; it turned on Avatar and LOTR 3, but not Harry Potter, once; and it turned on Harry Potter and LOTR 3, but not Avatar, twice. Note that, based on our training examples, these generated preferences do indeed match what we might expect real SF/fantasy fans want to watch. Modifications I tried to keep the connection-learning algorithm I described above pretty simple, so here are some modifications that often appear in practice: • Above,$Negative(e\_{ij})$was determined by taking the product of the$i$th and$j$th units after reconstructing the visible units once and then updating the hidden units again. We could also take the product after some larger number of reconstructions (i.e., repeat updating the visible units, then the hidden units, then the visible units again, and so on); this is slower, but describes the network’s daydreams more accurately. • Instead of using$Positive(e\_{ij})=x\_i \* x\_j$, where$x\_i$and$x\_j$are binary 0 or 1 states, we could also let$x\_i$and/or$x\_j$be activation probabilities. Similarly for$Negative(e\_{ij})$. • We could penalize larger edge weights, in order to get a sparser or more regularized model. • When updating edge weights, we could use a momentum factor: we would add to each edge a weighted sum of the current step as described above (i.e.,$L \* (Positive(e\_{ij}) - Negative(e\_{ij})$) and the step previously taken. • Instead of using only one training example in each epoch, we could use batches of examples in each epoch, and only update the network’s weights after passing through all the examples in the batch. This can speed up the learning by taking advantage of fast matrix-multiplication algorithms. Further If you’re interested in learning more about Restricted Boltzmann Machines, here are some good links. Topic Modeling the Sarah Palin Emails LDA-based Email Browser Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I’ve been working on some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups. I threw up a simple demo app to view the organized documents here. What is Latent Dirichlet Allocation? Briefly, given a set of documents, LDA tries to learn the latent topics underlying the set. It represents each document as a mixture of topics (generated from a Dirichlet distribution), each of which emits words with a certain probability. For example, given the sentence “I listened to Justin Bieber and Lady Gaga on the radio while driving around in my car”, an LDA model might represent this sentence as 75% about music (a topic which, say, emits the words Bieber with 10% probability, Gaga with 5% probability, radio with 1% probability, and so on) and 25% about cars (which might emit driving with 15% probability and cars with 10% probability). If you’re familiar with latent semantic analysis, you can think of LDA as a generative version. (For a more in-depth explanation, I wrote an introduction to LDA here.) Sarah Palin Email Topics Here’s a sample of the topics learnt by the model, as well as the top words for each topic. (Names, of course, are based on my own interpretation.) • Wildlife/BP Corrosion: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, … • Energy/Fuel/Oil/Mining: energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, … • Trig/Family/Inspiration: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, … • Gas: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, … • Education/Waste: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, … • Presidential Campaign/Elections: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, … Here’s a sample email from the wildlife topic: I also thought the classification for this email was really neat: the LDA model labeled it as 10% in the Presidential Campaign/Elections topic and 90% in the Wildlife topic, and it’s precisely a wildlife-based protest against Palin as a choice for VP: Future Analysis In a future post, I’ll perhaps see if we can glean any interesting patterns from the email topics. For example, for a quick graph now, if we look at the percentage of emails in the Trig/Family/Inspiration topic across time, we see that there’s a spike in April 2008 – exactly (and unsurprisingly) the month in which Trig was born. Filtering for English Tweets: Unsupervised Language Detection on Twitter (See a demo here.) While working on a Twitter sentiment analysis project, I ran into the problem of needing to filter out all non-English tweets. (Asking the Twitter API for English-only tweets doesn’t seem to work, as it nonetheless returns tweets in Spanish, Portuguese, Dutch, Russian, and a couple other languages.) Since I didn’t have any labeled data, I thought it would be fun to build an unsupervised language classifier. In particular, using an EM algorithm to build a naive Bayes model of English vs. non-English n-gram probabilities turned out to work quite well, so here’s a description. EM Algorithm Let’s recall the naive Bayes algorithm: given a tweet (a set of character n-grams), we estimate its language to be the language$L$that maximizes $$P(language = L | ngrams) \propto P(ngrams | language = L) P(language = L)$$ Thus, we need to estimate$P(ngram | language = L)$and$P(language = L)$. This would be easy if we knew the language of each tweet, since we could estimate •$P(xyz| language = English)$as #(number of times “xyz” is a trigram in the English tweets) / #(total trigrams in the English tweets) •$P(language = English)$as the proportion of English tweets. Or, it would also be easy if we knew the n-gram probabilities for each language, since we could use Bayes’ theorem to compute the language probabilities for each tweet, and then take a weighted variant of the previous paragraph. The problem is that we know neither of these. So what the EM algorithm says is that that we can simply guess: • Pretend we know the language of each tweet (by randomly assigning them at the beginning). • Using this guess, we can compute the n-gram probabilities for each language. • Using the n-gram probabilities for each language, we can recompute the language probabilities of each tweet. • Using these recomputed language probabilities, we can recompute the n-gram probabilities. • And so on, recomputing the language probabilities and n-gram probabilities over and over. While our guesses will be off in the beginning, the probabilities will eventually converge to (locally) minimize the likelihood. (In my tests, my language detector would sometimes correctly converge to an English detector, and sometimes it would converge to an English-and-Dutch detector.) EM Analogy for the Layman Why does this work? Suppose you suddenly move to New York, and you want a way to differentiate between tourists and New Yorkers based on their activities. Initially, you don’t know who’s a tourist and who’s a New Yorker, and you don’t know which are touristy activities and which are not. So you randomly place people into two groups A and B. (You randomly assign all tweets to a language) Now, given all the people in group A, you notice that a large number of them visit the Statue of Liberty; similarly, you notice that a large number of people in group B walk really quickly. (You notice that one set of words often has the n-gram “ing”, and that another set of words often has the n-gram “ias”; that is, you fix the language probabilities for each tweet, and recompute the n-gram probabilities for each language.) So you start to put people visiting the Statue of Liberty in group A, and you start to put fast walkers in group B. (You fix the n-gram probabilities for each language, and recompute the language probabilities for each tweet.) With your new A and B groups, you notice more differentiating factors: group A people tend to carry along cameras, and group B people tend to be more finance-savvy. So you start to put camera-carrying folks in group A, and finance-savvy folks in group B. And so on. Eventually, you settle on two groups of people and differentiating activities: people who walk slowly and visit the Statue of Liberty, and busy-looking people who walk fast and don’t visit. Assuming there are more native New Yorkers than tourists, you can then guess that the natives are the larger group. Results I wrote some Ruby code to implement the above algorithm, and trained it on half a million tweets, using English and “not English” as my two languages. The results looked surprisingly good from just eyeballing: But in order to get some hard metrics and to tune parameters (e.g., n-gram size), I needed a labeled dataset. So I pulled a set of English-language and Spanish-language documents from Project Gutenberg, and split them to form training and test sets (the training set consisted of 2000 lines of English and 1000 lines of Spanish, and 1000 lines of English and 1000 lines of Spanish for the test set). Trained on bigrams, the detector resulted in: • 991 true positives (English lines correctly classified as English) • 9 false negatives (English lines incorrectly classified as Spanish • 11 false positives (Spanish lines incorrectly classified as English) • 989 true negatives (Spanish lines correctly classified as English) for a precision of 0.989 and a recall of 0.991. Trained on trigrams, the detector resulted in: • 992 true positives • 8 false negatives • 10 false positives • 990 true negatives for a precision of 0.990 and a recall of 0.992. Also, when I looked at the sentences the detector was making errors on, I saw that they almost always consisted of only one or two words (e.g., the incorrectly classified sentences were lines like “inmortal”, “autumn”, and “salir”). So the detector pretty much never made a mistake on a normal sentence! Code/Demo I put the code on my Github account, and a quick demo app, trained on trigrams from tweets with lang=”en” according to the Twitter API, is here. Choosing a Machine Learning Classifier How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you’re simply looking for a “good enough” algorithm for your problem, or a place to start, here are some general guidelines I’ve found to work well over the years. How large is your training set? If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren’t powerful enough to provide accurate models. You can also think of this as a generative model vs. discriminative model distinction. Advantages of some particular algorithms Advantages of Naive Bayes: Super simple, you’re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well. Its main disadvantage is that it can’t learn interactions between features (e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together). Advantages of Logistic Regression: Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model. Advantages of Decision Trees: Easy to interpret and explain (for some people – I’m not sure I fall into this camp). They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don’t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days. Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive, hard to interpret, and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown. But… Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, then whichever classification algorithm you use might not matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead). And to reiterate what I said above, if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize (and Middle Earth), just use an ensemble method to choose them all. Kickstarter Data Analysis: Success and Pricing Kickstarter is an online crowdfunding platform for launching creative projects. When starting a new project, project owners specify a deadline and the minimum amount of money they need to raise. They receive the money (less a transaction fee) only if they reach or exceed that minimum; otherwise, no money changes hands. What’s particularly fun about Kickstarter is that in contrast to that other microfinance site, Kickstarter projects don’t ask for loans; instead, patrons receive pre-specified rewards unique to each project. For example, someone donating money to help an artist record an album might receive a digital copy of the album if they donate 20 dollars, or a digital copy plus a signed physical cd if they donate 50 dollars. There are a bunch of neat projects, and I’m tempted to put one of my own on there soon, so I thought it would be fun to gather some data from the site and see what makes a project successful. Categories I started by scraping the categories section. In true indie fashion, the artsy categories tend to dominate. (I’m surprised/disappointed how little love the Technology category gets.) Ending Soon The categories section really only provides a history of successful projects, though, so to get some data on unsuccessful projects as well, I took a look at the Ending Soon section of projects whose deadlines are about to pass. It looks like about 50% of all Kickstarter projects get successfully funded by the deadline: Interestingly, most of the final funding seems to happen in the final few days: with just 5 days left, only about 20% of all projects have been funded. (In other words, with just 5 days left, 60% of the projects that will eventually be successful are still unfunded.) So the approaching deadline seems to really spur people to donate. I wonder if it’s because of increased publicity in the final few days (the project owners begging everyone for help!) or if it’s simply procrastination in action (perhaps people want to wait to see if their donation is really necessary)? Lesson: if you’re still not fully funded with only a couple days remaining, don’t despair. Success vs. Failure What factors lead a project to succeed? Are there any quantitative differences between projects that eventually get funded and those that don’t? Two simple (if kind of obvious) things I noticed are that unsuccessful projects tend to require a larger amount of money: and unsuccessful projects also tend to raise less money in absolute terms (i.e., it’s not just that they ask for too much money to reach their goal – they’re simply not receiving enough money as well): Not terribly surprising, but it’s good to confirm (and I’m still working on finding other predictors). Pledge Rewards There’s a lot of interesting work in behavioral economics on pricing and choice – for example, the anchoring effect suggests that when building a menu, you should include an expensive item to make other menu items look reasonably priced in comparison, and the paradox of choice suggests that too many choices lead to a decision freeze – so one aspect of the Kickstarter data I was especially interested in was how pricing of rewards affects donations. For example, does pricing the lowest reward at 25 dollars lead to more money donated (people don’t lowball at 5 dollars instead) or less money donated (25 dollars is more money than most people are willing to give)? And what happens if a new reward at 5 dollars is added – again, does it lead to more money (now people can donate something they can afford) or less money (the people that would have paid 25 dollars switch to a 5 dollar donation)? First, here’s a look at the total number of pledges at each price. (More accurately, it’s the number of claimed rewards at each price.) [Update: the original version of this graph was wrong, but I’ve since fixed it.] Surprisingly, 5 dollar and 1 dollar donations are actually not the most common contribution. To investigate pricing effects, I started by looking at all (successful) projects that had a reward priced at 1 dollar, and compared the number of donations at 1 dollar with the number of donations at the next lowest reward. Up to about 15-20 dollars, there’s a steady increase in the proportion of people who choose the second reward over the first reward, but after that, the proportion decreases. So this perhaps suggests that if you’re going to price your lowest reward at 1 dollar, your next reward should cost roughly 20 dollars (or slightly more, to maximize your total revenue). Pricing above 20 dollars is a little too expensive for the folks who want to support you, but aren’t rich enough to throw gads of money; maybe rewards below 20 dollars aren’t good enough to merit the higher donation. Next, I’m planning on digging a little deeper into pricing effects and what makes a project successful, so I’ll hopefully have some more Kickstarter analysis in a future post. In the meantime, in case anyone else wants to take a look, I put the data onto my Github account. A Mathematical Introduction to Least Angle Regression (For a layman’s introduction, see here.) Least Angle Regression (aka LARS) is a model selection method for linear regression (when you’re worried about overfitting or want your model to be easily interpretable). To motivate it, let’s consider some other model selection methods: • Forward selection starts with no variables in the model, and at each step it adds to the model the variable with the most explanatory power, stopping if the explanatory power falls below some threshold. This is a fast and simple method, but it can also be too greedy: we fully add variables at each step, so correlated predictors don’t get much of a chance to be included in the model. (For example, suppose we want to build a model for the deliciousness of a PB&J sandwich, and two of our variables are the amount of peanut butter and the amount of jelly. We’d like both variables to appear in our model, but since amount of peanut butter is (let’s assume) strongly correlated with the amount of jelly, once we fully add peanut butter to our model, jelly doesn’t add much explanatory power anymore, and so it’s unlikely to be added.) • Forward stagewise regression tries to remedy the greediness of forward selection by only partially adding variables. Whereas forward selection finds the variable with the most explanatory power and goes all out in adding it to the model, forward stagewise finds the variable with the most explanatory power and updates its weight by only epsilon in the correct direction. (So we might first increase the weight of peanut butter a little bit, then increase the weight of peanut butter again, then increase the weight of jelly, then increase the weight of bread, and then increase the weight of peanut butter once more.) The problem now is that we have to make a ton of updates, so forward stagewise can be very inefficient. LARS, then, is essentially forward stagewise made fast. Instead of making tiny hops in the direction of one variable at a time, LARS makes optimally-sized leaps in optimal directions. These directions are chosen to make equal angles (equal correlations) with each of the variables currently in our model. (We like peanut butter best, so we start eating it first; as we eat more, we get a little sick of it, so jelly starts looking equally appetizing, and we start eating peanut butter and jelly simultaneously; later, we add bread to the mix, etc.) In more detail, LARS works as follows: • Assume for simplicity that we’ve standardized our explanatory variables to have zero mean and unit variance, and that our response variable also has zero mean. • Start with no variables in your model. • Find the variable$ x_1 $most correlated with the residual. (Note that the variable most correlated with the residual is equivalently the one that makes the least angle with the residual, whence the name.) • Move in the direction of this variable until some other variable$ x_2 $is just as correlated. • At this point, start moving in a direction such that the residual stays equally correlated with$ x_1 $and$ x_2 $(i.e., so that the residual makes equal angles with both variables), and keep moving until some variable$ x_3 $becomes equally correlated with our residual. • And so on, stopping when we’ve decided our model is big enough. For example, consider the following image (slightly simplified from the original LARS paper;$x_1, x_2$are our variables, and$y$is our response): Our model starts at$ \hat{\mu_0} $. • The residual (the green line) makes a smaller angle with$ x_1 $than with$ x_2 $, so we start moving in the direction of$ x_1 $. At$ \hat{\mu_1} $, the residual now makes equal angles with$ x_1, x_2 \$, and so we start moving in a new direction that preserves this equiangularity/equicorrelation.
• If there were more variables, we’d change directions again once a new variable made equal angles with our residual, and so on.

So when should you use LARS, as opposed to some other regularization method like lasso? There’s not really a clear-cut answer, but LARS tends to give very similar results as both lasso and forward stagewise (in fact, slight modifications to LARS give you lasso and forward stagewise), so I tend to just use lasso when I do these kinds of things, since the justifications for lasso make a little more sense to me. In fact, I don’t usually even think of LARS as a model selection method in its own right, but rather as a way to efficiently implement lasso (especially if you want to compute the full regularization path).