Posts

Hacker News Dataset October 2016

Image
Our latest project on Sizzle  is a visualization of the Top 10k Posts of All Time  on Hacker News . To create the visualization, we first needed to collect the data. I noticed that there was an old copy of the hacker news dataset available on Big Query . But I needed an up-to-date copy, so I looked into the Hacker News Firebase API . The API allows you to get each item by Id. You can start by retrieving the current Max ID, then walking backwards from there. (Items my be stories, comments, etc., it's the same API endpoints for all types of items.) There is no rate limit, so I created the following script that will generate a text file with 10MM lines containing all of the URIs to retrieve. (we will then feed this file into wget using xargs) Note: 10MM items was ~5 years worth of data. Script to create the 10MM line file of URIs to retrieve: https://gist.github.com/aaronhoffman/1f753c660d7364bb594a36af350b227c That script takes about 10 minutes to produce a file t