Skip to Content

Creating a large covariance matrix

I need to create ~110 covariance matrices of doubles size 19347 x 19347 then add them all together.

This in itself isn't very difficult and for smaller matrices the following code works fine.

covmat <- matrix(0, ncol=19347, nrow=19347)
files<-list.files("path/to/folder/")
for(name in files){
  text <- readLines(paste("path/to/folder/", name, sep=""),  n=19347, encoding="UTF-8")
   for(i in 1:19347){
    for(k in 1:19347){
      covmat[i, k]  <- covmat[i,k] + (as.numeric(text[i]) * as.numeric(text[k]))
    }
  }
}

To save memory I don't calculate each individual matrix but add them together as it loops through each file.

The problem is when I run it on the real data I need to use that it takes far too long. There isn't actually that much data but I think it is a CPU and memory intensive job. Thus running it for ~10 hours doesn't compute a result.

I have looked into trying to use Map Reduce (AWS EMR) but I've come to the conclusion that I don't believe this is a Map Reduce problem as it isn't a big data problem. However here is the code for my mapper and reducer I have been playing with if I have just been doing it wrong.

#Mapper
text <- readLines("stdin",  n=4, encoding="UTF-8")
covmat <- matrix(0, ncol=5, nrow=5)
 
for(i in 1:5){
  for(k in 1:5){
     covmat[i, k]  <- (as.numeric(text[i]) * as.numeric(text[k]))
  }
}
 
cat(covmat)
 
#Reducer
trimWhiteSpace <- function(line) gsub("(^ +)|( +$)", "", line)
splitIntoWords <- function(line) unlist(strsplit(line, "[[:space:]]+"))
final <- matrix(0, ncol=19347, nrow=19347)
## **** could wo with a single readLines or in blocks
con <- file("stdin", open = "r")
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
 
    line <- trimWhiteSpace(line)
    words <- splitIntoWords(line)
    final <- final + matrix(as.numeric(words), ncol=19347, nrow=19347)
}
close(con)
cat(final)

Can anyone suggest how to solve this problem?

Thanks in advance

EDIT

Thanks to the great help from some of the commenters below I have revised the code so it is much more efficient.

files<-list.files("path/to/file")
covmat <- matrix(0, ncol=19347, nrow = 19347)
for(name in files){
   invec <- scan(paste("path/to/file", name, sep=""))
   covmat <- covmat + outer(invec,invec, "*")
}

Here is an example of a file I am trying to process.

1       0.00114582882882883
2      -0.00792611711711709
...                     ...
19346  -0.00089507207207207
19347  -0.00704709909909909

On running the program it still takes ~10mins per file. Does anyone have any advice on how this can be sped up?

I have 8gb of RAM and when the program runs R is only using 4.5GB of that and there is a small amount free.

I am running Mac OS X Snow Leopard and R 64bit v. 2.15