# Hacker News Analysis

I was playing around with the Hacker News database Ronnie Roller made (thanks!), so I thought I’d post some of the things I looked at.

# Activity on the Site

My first question was how activity on the site has increased over time. I looked at number of posts, points on posts, comments on posts, and number of users.

## Posts

This looks like a strong linear fit, with an increase of 292 posts every month.

## Points

A quadratic regression was also a better fit for points by month:

## Users

And again for the number of distinct users with a submission:

My next question was how points and comments related. Intuitively, posts with more points should have more comments, but it’s nice to check (maybe really good posts are kind of boring, so don’t lead to much discussion).

First, I plotted the points and comments of each individual post:

As expected, there’s an overall positive correlation between points and comments. Interestingly, there are quite a few high-points posts with no comments.

The plot’s quite noisy, though, so let’s try cleaning it up a bit, by taking the median number of comments per points level (and removing posts at the higher end, where we have little data):

We see that posts with more points do tend to have more comments. Also, variance in number of comments is indicated by size and color, so (unsurprisingly) posts with more points have larger variance in their number of comments.

# Quality of Posts

Another question was whether the quality of posts has degraded over time.

First, I computed a normalized “score” for each post, where a post’s score is defined as the number of points divided by the number of distinct users who made a submission in the same month. (The denominator is a rough proxy for the number of active users, and the goal of the score is to provide a way to compare posts across time.)

While the median score has declined over time (as perhaps should be expected, since only a fixed number of items can reach the front page):

the absolute number of quality posts, defined as posts with a score greater than the (admittedly arbitrarily chosen) threshold 0.01, has increased (until possibly a dip starting in 2010):

(Of course, without some further analysis, it’s not clear how well this score measures quality of posts, so take these numbers with a grain of salt.)

# Company Trends

Also, I wanted to see how certain topics have trended over time, so I looked at how mentions of some of the big-name companies (Google, Facebook, Microsoft, Yahoo, Twitter, Apple) have changed. For each company, I plotted the percentage of posts with the company’s name in the title, and also made a smoothed plot comparing all six at the end. Note that Microsoft and Yahoo seem to be trending slightly downward, and Apple seems to be trending upward.

Founder at Surge AI. We're a team of engineers and researchers from Google, Facebook, Harvard, and MIT building a modern data labeling platform and workforce for NLP.

Need obsessively high-quality human-powered data fast? Reach out! We help top AI companies like OpenAI, Amazon, and Airbnb create powreful human-labeled datasets to train and measure their AI.