Skip to Content

Extracting Time Series from Large Data Sets

RevoJoe's picture

Introduction

Analyzing time series data of all sorts is a fundamental business analytics task to which the R language is beautifully suited. In addition to the time series functions built into base stats library there are dozens of R packages devoted to time series http://crantastic.org/task_views/TimeSeries. Some packages help with basic tasks such as creating date data types, others offer specialized functions for financial applications. When working with R the difficult part isn’t finding the right analytical tool; often, it’s getting the time series data to begin with. This is especially true when the time series need to be extracted from time stamped data embedded in very large data sets: data sets that are too large to be read into memory. In this example, we are going to use “data step” functions in Revolution Analytics’ RevoScaleR package to access a large data file, manipulate it, sort it, extract the data we need and aggregate records with monthly time stamps to form multiple, monthly time series. Then we will use ordinary R time series functions to do some basic analysis.
To illustrate this process we will use the airlines “edge” flight data set (77,242 KB) from infochimps.com http://www.infochimps.com/datasets/us-domestic-flights-from-1990-to-2009 which contains 3.5 million monthly domestic flight records from 1990 to 2009.

Short name 	Type 	Description
Origin 	        String 	Three letter airport code of the origin airport
Destination 	String 	Three letter airport code of the destination airport
Passengers 	Integer No. of passengers from origin to destination
Flights         Integer No. of flights between origin and destination
Fly Date        Integer The date (yyyymm) of flight

We are going to be building time series of the flights originating from particular airports, but process we will use is just as applicable to internet transaction data, web logs, product sales data and any other business data set where records are time stamped but not stored as time series. The information in multiple records must be aggregated to form the time series.

Getting Started

After downloading the data from infochimps at the link given above, we use the RevoScaleR function to read the text file into the special xdf binary format used by RevoScaleR functions:
[r]
###############################################################################################
# READ IN PRIMARY DATA FILE
FlightsText <- "C:/Users/Joseph/Documents/DATA/US Flights/flights_with_colnames.csv"
#
# Note we make month a character field so we can parse it into month and year later
rxTextToXdf(inFile = FlightsText, outFile = "Flights", overwrite = TRUE,
colClasses=c(origin_airport = "factor", destin_airport = "factor", month="character"))
#
rxGetInfoXdf("Flights",getVarInfo=TRUE)
rxGetInfoXdf(file="Flights",numRows=10,startRow=1)
[/r]

The final two lines of code above use different forms of the rxGetInfoXdf function to look at the data. The output of the first:

Number of observations: 3606803 
Number of variables: 5 
Number of blocks: 8 
Variable information: 
Var 1: origin_airport
       683 factor levels: MHK EUG MFR SEA PDX ... CRE BOK BIH MQJ LCI
Var 2: destin_airport
       708 factor levels: AMW RDM EKO WDG END ... COS HII PHD TBN OH1
Var 3: passengers, Type: integer, Low/High: (0, 89597)
Var 4: flights, Type: integer, Low/High: (0, 1128)
Var 5: month, Type: character

shows that there are 3.6 million records, 5 variables, 683 origin airports, 708 destination airports with one origin/destinatin pair of airports having a high of 1,128 flights for some month. The second variation of the command shows the first 10 rows of the file:

origin_airport destin_airport    passengers   flights  month
1             MHK            AMW         21       1 200810
2             EUG            RDM         41      22 199011
3             EUG            RDM         88      19 199012
4             EUG            RDM         11       4 199010
5             MFR            RDM          0       1 199002
6             MFR            RDM         11       1 199003
7             MFR            RDM          2       4 199001
8             MFR            RDM          7       1 199009
9             MFR            RDM          7       2 199011
10            SEA            RDM          8       1 199002

Exploring the Data

The next bit of code sorts the file by flights to find the origin / destination pairs having the most monthly flights and picks out the two top origin airports having the most flights; SFO and LAX
[r]
# DATA EXPLORATION
# Find airports with most flights in a given month
rxSort(inData="Flights", outFile = "sortFlights", sortByVars="flights",
decreasing = TRUE,overwrite=TRUE)

rxGetInfoXdf(file="sortFlights")
mostflights5 <- rxGetInfoXdf(file="sortFlights",numRows=5,startRow=1)
mostflights5
top5f <- as.data.frame(mostflights5[[5]])
topOA <- unique(as.vector(top5f$origin_airport))
# Select the top 2
top2 <- topOA[1:2]
top2
[/r]

The variable “mostflights” contains the first five lines of the sorted file:

> mostflights5
File name: sortFlights.xdf 
Number of observations: 3606803 
Number of variables: 5 
Number of blocks: 8 
Data (5 rows starting with row 1):
  origin_airport destin_airport passengers flights  month
1            SFO            LAX      83153    1128 199412
2            LAX            SFO      80450    1126 199412
3            HNL            OGG      73014    1058 199408
4            OGG            HNL      77011    1056 199408
5            OGG            HNL      63020    1044 199412

Next, we use the RevoScaleR function rxDataStep to build a new file “mostFlights” containing only those flights that originate in either SFO or LAX.

[r]
# Build a file with only SFO and LAX flights
# Be careful to keep the level straight
xform <- function(data)
{
data$origin <- factor(data$origin_airport, levels=top2,labels = top2)
return(data)
}
rxDataStepXdf(inFile="Flights", outFile="mostFlights",
rowSelection = origin_airport %in% top2,
transformFunc = xform,
userObjects = list(top2=top2),
overwrite = TRUE)

rxGetInfoXdf("mostFlights",numRows=10,startRow=1)
rxGetInfoXdf("mostFlights",getVarInfo=TRUE)
rxHistogram(~flights|origin, data="mostFlights")
#
[/r]

The transformation function, xform, used in rxDataStep creates a new variable, origin, with only two levels (“SFO” and “LAX”) to hold the information on origin airports. The last line of code in this section produces the following histogram of monthly flights

Preparing the Data for Time Series Work

Next, we get set up for the time series by breaking out the month variable (which we originally imported as character data) into a month and year component.

[r]
# Build a transform function to separate out the month year data
xfunc <- function(data){
data$Month <- as.integer(substring(data$month,5,6))
data$Year <- as.integer(substring(data$month,1,4))
return(data)
}
# Add a new variable for time series work
rxDataStepXdf(inFile="mostFlights", outFile = "SFO_LAX",
overwrite = TRUE, transformVars="month",transformFunc = xfunc)
(file="SFO_LAX",numRows=10,startRow=1)
#
rxDataStepXdf(inFile="SFO_LAX", outFile = "SFO.LAX",
varsToDrop=c("origin_airport","month"),
overwrite=TRUE)

rxGetInfoXdf(file="SFO.LAX",numRows=10,startRow=1)
#
[/r]

The transformation function, xfunc, used in rxDataStepXdf uses ordinary R string handling functions to break apart the month data. A second data step function drops the unnecessary variables from our final file: SFO.LAX.

Working with Time Series

At last, we begin building the time series. The function rxCube counts the number of flights in each combination of Year, Month and origin airport.

[r]
#Aggregate data into monthly flights time series
t1 <-rxCube(flights ~ F(Year):F(Month):origin, removeZeroCounts=TRUE,data = "SFO.LAX")
t1 <- as.data.frame(t1)
head(t1)
[/r]

> head(t1)
  F_Year F_Month origin  flights Counts
1   1990       1    SFO 39.04225    284
2   1991       1    SFO 38.42034    295
3   1992       1    SFO 46.23954    263
4   1993       1    SFO 44.39464    261
5   1994       1    SFO 36.15417    240
6   1995       1    SFO 45.76768    198

From this, we see that there were 284 records where the originating airport was SFO for the first month of 1990. The average number of flights among these 284 counts was 39.04225. From this information, we can calculate the total number of flights for each month. The next bit of code does this and forms the time information into a proper date. Note that we have reduced the data sufficiently so that we are now working with a data frame, t1.

[r]
# Compute total flights out and combine month and dat into a date
t1$flights_out<- t1$flights*t1$Counts
names(t1) <- c("Year","Month","origin","avg.flights.per.destin","total.destin","flights.out")
t1$Date <- as.Date(as.character(paste(t1$Month,"- 28 -",t1$Year)),"%m - %d - %Y")
head(t1)
[/r]

> head(t1)
  Year Month origin avg.flights.per.destin total.destin flights.out       Date
1 1990     1    SFO               39.04225          284       11088 1990-01-28
2 1991     1    SFO               38.42034          295       11334 1991-01-28
3 1992     1    SFO               46.23954          263       12161 1992-01-28
4 1993     1    SFO               44.39464          261       11587 1993-01-28
5 1994     1    SFO               36.15417          240        8677 1994-01-28
6 1995     1    SFO               45.76768          198        9062 1995-01-28

Now, we extract out the SFO data, sort it to form a time series and plot it.
[r]
# Select SFO entries and plot
SFO.t1 <- t1[t1$origin=="SFO",]
head(SFO.t1)
SFO.t1 <- SFO.t1[order(SFO.t1$Date),]
# Make x axis a date and plot
x <-SFO.t1$Date
y <-SFO.t1$flights.out
library(ggplot2) # make sure ggplot is package is loaded
qplot(x,y, geom="line",xlab="", ylab="Number of Flights\n",main="Monthly Flights Out of SFO")
[/r]

And, just to do a little real time series work we use the R function, ts, to form the data into a time series object, and use the function stl to perform a seasonal decomposition which we plot.
[r]
# Try a seasonal decomposition
SFO.ts <- ts(y,start=x[1],freq=12)
sd.SFO <- stl(SFO.ts,s.window="periodic")
plot(sd.SFO)
[/r]

Finally, we repeat this process for the LAX data

[r]
LAX.t1 <- t1[t1$origin=="LAX",]
LAX.t1 <- LAX.t1[order(LAX.t1$Date),]
# Make x axis a date and plot
x <-LAX.t1$Date
y <-LAX.t1$flights.out
qplot(x,y, geom="line",xlab="", ylab="Number of Flights\n",main="Monthly Flights Out of LAX")
# Try a seasonal decomposition
LAX.ts <- ts(y,start=x[1],freq=12)
sd.LAX <- stl(LAX.ts,s.window="periodic")
plot(sd.LAX)
[/r]

Summary

We have shown how data manipulation functions of the RevoScaleR package to extract time stamped data from a large data file, aggregate it, and form it into monthly time series that can easily be analyzed with standard R functions.

Comments

David Smith's picture

As a Revolution Analytics employee, Joe's not eligible to win prizes in the Applications of R contest, but this is a great example of a submission that would get the attention of the judges.

lowell.it1's picture

Our flagship link building service – Paint It White. Links from manually created beautiful web 2.0 posts on premium blogs. Supported by an array of tier 2 and 3 links for maximum ranking increases
link building service

lowell.it1's picture

Our flagship link building service – Paint It White. Links from manually created beautiful web 2.0 posts on premium blogs. Supported by an array of tier 2 and 3 links for maximum ranking increases
link building service