# Speed up a sampling function [migrated]

I have the following dataset

kkk<-data.frame(days=1:100,positive=rbinom(100,1,0.05))

Over the monitoring period of 100 days, if an event occurs then for that day `kkk$positive==1`

else `kkk$positive==0`

The following function, samples `k=s.freq=1:12`

number of times `j=s.length=1:30`

consecutive elements from the sample space and checks if at least in one of the samples an event was recorded. The sample space consists of all possible `j`

consecutive combinations of `kkk$days`

. The elements of the sample space are overlapping. Before each of the `k=s.freq=1:12`

samples, the sample space is reevaluated and elements that have common days with the sampled element are removed before the next sampling occurs (for the total of `k`

samplings).

my.sampling<-function(days, status, s.length, s.freq, iter){ start<-Sys.time() #build the data frame ddd<-data.frame(days=days, positive=status) #length of each sample j<-s.length #Buld the sample space monitor.ss<-matrix(nrow=dim(ddd)[1]-j+1, ncol=j+1) n.row<-dim(ddd)[1]-j+1 #Create the elements of sample space: each row of the monitor.ss is an element of the sample space for (i in 1:j){ monitor.ss[,i]<-i:(n.row-1+i) } rm(i) #adds if the each of the possible samples of the sample space at least one event was observed or not monitor.ss[,j+1]<-apply(monitor.ss,1,function(x) { r.low<-range(x,na.rm=T)[1] r.high<-range(x,na.rm=T)[2] return(as.numeric(sum(ddd$positive[r.low:r.high])>0))}) #Build the initial sample space as data frame sampling.start<-data.frame(monitor.ss) suc.list<-rep(NA, iter) #Initiate a vector to keep track of successes #Now the sampling takes place for (i in 1:iter) { #for each iteration sampling.start<-data.frame(monitor.ss) k<-s.freq #number of samples drawn from the sample space k.t=0 #controller to break the while loop suc<-0 while (k.t<=k | dim(sampling.start)[1]>0) { #breaks the while loop if k reaches the specified limit (k=s.freq) or if the reduction of the sample space (see below) does not leave any elements in the sample space s<-sampling.start[sample(nrow(sampling.start),1),] #samples the first trial suc<-suc+s[1,j+1] #adds up if there is a success or not ( a tleast one event in the sample was observed) sampling.start <- sampling.start[apply(sampling.start, 1, function(x) !any(x %in% as.numeric(s[,1:j]))),] #sample space reduction: Elements of the sample space that intersect with the sample that was just selected are removed from the sample space before the next sampling occurs k.t<-k.t+1 #controller for the wile loop } suc.list[i]<-(suc/k.t) } print(Sys.time()-start) message("") #hist(suc.list, breaks=sqrt(iter)) return(suc.list) }

For example:

my.sampling(kkk$days, kkk$positive, s.length=3, s.freq=5, iter=300)

My problem is that the function requires about 100msec for each iteration and I need to run this function about 100.000 times with at least 1000 iterations each time. I have used preallocation of the vectors and apply to speed things up, however its still too slow.

Parallel processing is applicable to my problem as I can split the data on which the function will have to run on multiple cores, however if my calculations are right I need about 120 days of processing pro CPU core.

Is this possible to reduce the run time of the function? I have tried cpmfun but without much improvement. Otherwise I have to resort to some type of AWS EC2 solution.

Thanks a lot