Skip to Content

Big Data Generalized Linear Models with Revolution R Enterprise

David Smith's picture

R''s glm function for generalized linear modeling is very powerful and flexible: it supports all of the standard model types (binomial/logistic, Gamma, Poisson, etc.) and in fact you can fit any distribution in the exponential family (with the family argument). But if you want to use it on a data set with millions or rows, and especially with more than a couple of dozen variables (or even just a few categorical variables with many levels), this is a big computational task that quickly grows in time as the data gets larger, or even exhaust the available memory.

The rxGlm function included in the RevoScaleR package in Revolution R Enterprise 6 has the same capabilities as R's glm, but is designed to work with big data, and to speed up the computation using the power of multiple processors and nodes in a distributed grid. In the analysis of census data in the video below, fitting a Tweedie model on 5M observations and 265 variables takes around 25 seconds on a laptop. A similar analysis, using 14 million observations on a 5-node Windows HPC Server cluster takes just 20 seconds.

 

This demonstration was part of last week's webinar on Revolution R Enterprise 6. If you're not familiar with Revolution R Enterprise, the first 10 minutes is an overview of the differences from open-source R, and the remaining 20 minutes describes the new features in version 6. Follow the link below to check out the replay.

Revolution Analytics webinars: 100% R and More: Plus What's New in Revolution R Enterprise 6.0