Saturday, January 14, 2017

Census.gov QuickFacts Data Set

To create a visualization for https://www.sizzleanalytics.com/ I had to compile census information from the census.gov QuickFacts tool. As for as I could tell, they don't provide an easy way to automatically download all the data, so I manually downloaded each state then used a simple node script to merge them together.

Data set available here:
https://data.world/aaronhoffman/census-gov-state-quickfacts


Hope this helps,
Aaron


Thursday, January 12, 2017

Azure Functions and AWS Lambdas

I'm working on a new talk about Azure Functions and AWS Lambdas. I'm writing this blog post to compile a list of links to more information for people that attended the talk.


Presentation Slides: https://drive.google.com/open?id=0BwgLvVq0rcS7cmxrQUhCWUxTNTQ

Azure Function Intro Source Code: https://github.com/Stonefinch/AzureFunctionIntro

Azure Intro Web App Source Code: https://github.com/Stonefinch/AzureIntro

Azure Function Tools for Visual Studio: https://blogs.msdn.microsoft.com/webdev/2016/12/01/visual-studio-tools-for-azure-functions/

AWS Lambda C# in Visual Studio: https://aws.amazon.com/blogs/developer/using-the-aws-lambda-project-in-visual-studio/


I'll add more links and update the slides as I continue to build out the talk.


Hope this helps,
Aaron








Monday, October 24, 2016

Hacker News Dataset October 2016

Our latest project on Sizzle is a visualization of the Top 10k Posts of All Time on Hacker News.



To create the visualization, we first needed to collect the data.

I noticed that there was an old copy of the hacker news dataset available on Big Query. But I needed an up-to-date copy, so I looked into the Hacker News Firebase API.

The API allows you to get each item by Id. You can start by retrieving the current Max ID, then walking backwards from there. (Items my be stories, comments, etc., it's the same API endpoints for all types of items.)

There is no rate limit, so I created the following script that will generate a text file with 10MM lines containing all of the URIs to retrieve. (we will then feed this file into wget using xargs)

Note: 10MM items was ~5 years worth of data.

Script to create the 10MM line file of URIs to retrieve:
https://gist.github.com/aaronhoffman/1f753c660d7364bb594a36af350b227c

That script takes about 10 minutes to produce a file that's around 560MB in size.

After the file is generated, you can feed it to wget using xargs to retrieve all the URIs.

ex:
cat hn-uri.txt | xargs -P 100 -n 100 wget --quiet

wget will save the result of each GET request in a separate file with the format {id}.json

Caution: That command took just over 30 hours to complete on my macbook. (it also killed Finder a couple times and I had to disable spotlight on the folder I was saving all the .json files to)


I found that it can be difficult to work with 10MM files in a single directory on my mac, so I will try to save you the trouble.

Here are a couple copies of the dataset I retrieved:

1. Here is a zip of the directory containing 10MM json files. (4GB)

2. Here is a SQL Server backup of a database containing a single table, that contains one record per json file. (2GB)


Hope this helps!
Aaron