Skip to Content

Speed up a sampling function [migrated]

I have the following dataset

kkk<-data.frame(days=1:100,positive=rbinom(100,1,0.05)) 

Over the monitoring period of 100 days, if an event occurs then for that day kkk$positive==1 else kkk$positive==0

The following function, samples k=s.freq=1:12 number of times j=s.length=1:30 consecutive elements from the sample space and checks if at least in one of the samples an event was recorded. The sample space consists of all possible j consecutive combinations of kkk$days. The elements of the sample space are overlapping. Before each of the k=s.freq=1:12 samples, the sample space is reevaluated and elements that have common days with the sampled element are removed before the next sampling occurs (for the total of k samplings).

my.sampling<-function(days, status, s.length, s.freq, iter){
 
    start<-Sys.time()
 
    #build the data frame
    ddd<-data.frame(days=days, positive=status)
 
#length of each sample
    j<-s.length
 
#Buld the sample space
    monitor.ss<-matrix(nrow=dim(ddd)[1]-j+1, ncol=j+1)
    n.row<-dim(ddd)[1]-j+1
 
    #Create the elements of sample space: each row of the monitor.ss is an element of the sample space
    for (i in 1:j){
        monitor.ss[,i]<-i:(n.row-1+i)
    }
    rm(i)
 
 
    #adds if the each of the possible samples of the sample space at least one event was observed or not
    monitor.ss[,j+1]<-apply(monitor.ss,1,function(x) {
        r.low<-range(x,na.rm=T)[1]
        r.high<-range(x,na.rm=T)[2]
        return(as.numeric(sum(ddd$positive[r.low:r.high])>0))}) 
 
 
    #Build the initial sample space as data frame 
    sampling.start<-data.frame(monitor.ss)
 
suc.list<-rep(NA, iter) #Initiate a vector to keep track of successes
 
    #Now the sampling takes place
 
    for (i in 1:iter) { #for each iteration
 
        sampling.start<-data.frame(monitor.ss)
        k<-s.freq #number of samples drawn from the sample space
        k.t=0 #controller to break the while loop
        suc<-0
 
        while (k.t<=k | dim(sampling.start)[1]>0) { #breaks the while loop if k reaches the specified limit (k=s.freq) or if the reduction of the sample space (see below) does not leave any elements in the sample space
 
            s<-sampling.start[sample(nrow(sampling.start),1),] #samples the first trial
            suc<-suc+s[1,j+1] #adds up if there is a success or not ( a tleast one event in the sample was observed)
            sampling.start <- sampling.start[apply(sampling.start, 1, function(x) !any(x %in% as.numeric(s[,1:j]))),] #sample space reduction: Elements of the sample space that intersect with the sample that was just selected are removed from the sample space before the next sampling occurs
            k.t<-k.t+1 #controller for the wile loop
        }
 
        suc.list[i]<-(suc/k.t)
 
    }
    print(Sys.time()-start)
    message("")
    #hist(suc.list, breaks=sqrt(iter))
    return(suc.list)
}

For example:

my.sampling(kkk$days, kkk$positive, s.length=3, s.freq=5, iter=300)

My problem is that the function requires about 100msec for each iteration and I need to run this function about 100.000 times with at least 1000 iterations each time. I have used preallocation of the vectors and apply to speed things up, however its still too slow.

Parallel processing is applicable to my problem as I can split the data on which the function will have to run on multiple cores, however if my calculations are right I need about 120 days of processing pro CPU core.

Is this possible to reduce the run time of the function? I have tried cpmfun but without much improvement. Otherwise I have to resort to some type of AWS EC2 solution.

Thanks a lot