Skip to Content

Google BigQuery and the Github Data Challenge

David Smith's picture

Github has made data on its code repositories, developer updates, forks etc. from the public GitHub timeline available for analysis, and is offering prizes for the most interesting visualization of the data. Sounds like a great challenge for R programmers! The R language is currently the 26th most popular on GitHub (up from #29 in December), and it would be interesting to visualize the usage of R compared to other languages, for example. The deadline for submissions to the contest is May 21.

Interestingly, GitHub has made this data available on the Google BigQuery service, which is available to the public today. BigQuery was free to use while it was in beta test, but Google is now charging for storage of the data: $0.12 per gigabyte per month, up to $240/month (the service is limited to 2TB of storage - although there a Premier offering that supports larger data sizes ... at a price to be negotiated). While members of the public can run SQL-like queries on the GitHub data for free, Google is charging subscribers to the service 3.5 cents per Gb processed in the query: this is measured by the source data accessed (although columns of data not referenced aren't counted); the size of the result set doesn't matter.

With analysis limited to simple queries — "select" statements and by-row aggregations and the like — it's hard to see how this will have a big impact on companies needing to do even moderately advanced analytics on their data. It may prove useful as a store for relatively static data (like the GitHub example above), but given it takes 20 minutes to transfer 200Gb of data into BigQuery it doesn't seem well suited for frequently-changing data. Unlike Amazon S3 though, Google doesn't charge for data transfer in and out of the cloud — but at least with Amazon you can transfer data for free to Amazon AWS. And with Amazon AWS, you have a limitless range of AMIs (machine images) at your disposal, and can use advanced analytic tools like Revolution R Enterprise for predictive modeling and the like. If you're already storing your data in the cloud (as more companies are doing — Netflix is great example), this makes Amazon a more compelling cloud platform for Big Data Analytics. But if Google opens up more flexible data analysis options (or even the Google Prediction API) to BigQuery, this might change.