These functions parallelize several segmentation algorithms or (for HaarSeg) make their calling use the same conventions as for other methods.
pSegmentDNAcopy(x, chrom.numeric, parall = "arr", ...) pSegmentACE(x, chrom.numeric, parall = "auto", ...) pSegmentHMM(x, chrom.numeric, parall = "auto",...) pSegmentGLAD(x, chrom.numeric, ...) pSegmentBioHMM(x, chrom.numeric, Pos, parall = "auto", ...) pSegmentCGHseg(x, chrom.numeric, CGHseg.thres = -0.05, ...) pSegmentPSW(x, chrom.numeric, common.data, sign = -1, nIter = 1000, prec = 100, p.crit = 0.10, name = NULL, ...) pSegmentWavelets(x, chrom.numeric, mergeSegs = TRUE, minDiff = 0.25, minMergeDiff = 0.05, thrLvl = 3, initClusterLevels = 10, ...) pSegmentHaarSeg(x, chrom.numeric, HaarSeg.m = 3, W = vector(), rawI = vector(), breaksFdrQ = 0.001, haarStartLevel = 1, haarEndLevel = 5, ...)
- The aCGH data; genes in rows, subjects or arrays in columns. Data should be ordered by chromosome and position within chromosome.
- The (numeric) vector with the chromosome indicator
- The location (e.g., position in kbases) of each probe in the chromosome. A vector of the same length as the number of rows in
- Where available whether to parallelize over arrays (all except ACE) or chromosomes (ACE) or array by chromosome (ACE, DNAcopy, HMM, BioHMM.
- The threshold for the adaptive penalization in Picard et al.'s CGHseg. See p. 13 of the original paper. Must be a negative number. The default value used in the original reference is -0.5. However, our experience with the simulated data in Willenbrock and Fridlyand (2005) indicates that for those data values around -0.005 are more appropriate. We use here -0.05 as default.
- Merge (genome-wide) the resulting segments using
mergeLevels?. See details.
- A data frame with columns that contain the common information about the clones/genes. For now, this is limited to five columns of data. We do nothing with this, except it is part of the output. It is common practice that this data frame contains columns ID, Chrosome (need not be numeric, can be a factor), Start (the initial position of a clone or gene), End (the final position of a clone or gene) and MidPoint (often, the mid point or whatever you regard as appropriate representative of the "center" of the clone). But you can use something else.
- The largest p-value for which we want a region to be shown, in red, in the plot.
- Sample name or ID
- Minimum (absolute) difference between the medians of two adjacent clusters for them to be considered truly different. Clusters "closer" together than this are collapsed together to form a single cluster.
- Used only when doing merging in the wavelet method. The finall call as to which segments go together is done by a
mergeLevelsapproach, but an initial collapsing of very close values is performed (otherwise, we could end up passing to mergeLevels as many initial levels as there are points).
- The level used for the wavelet thresholding in Hsu et al.
- The initial number of clusters to form.
- For HaarSeg the value such that all segments where abs(smoothed value) > m*MAD will be declared aberrant ---see p. i141 of Ben-Yaacov and Eldar
- For HaarSeg: Weight matrix, corresponding to quality of measurment. Insert 1/(sigma**2) as weights if your platform output sigma as the quality of measurment. W must have the same size as I.
- For HaarSeg. Mininum of the raw red and raw green measurment, before the log. rawI is used for the non-stationary variance compensation. rawI must have the same size as I.
- For HaarSeg. The FDR q parameter. Common used values are 0.05, 0.01, 0.001. Default value is 0.001.
- For HaarSeg. The detail subband from which we start to detect peaks. The higher this value is, the less sensitive we are to short segments. The default is value is 1, corresponding to segments of 2 probes.
- For HaarSeg. The detail subband until which we use to detect peaks. The higher this value is, the more sensitive we re to large trends in the data. This value DOES NOT indicate the largest possible segment that can be detected. The default is value is 5, corresponding to step of 32 probes in each direction.
- Additional arguments; not used.
In most cases, these are wrappers to the original code, with modifications for parallelization. For CGHseg, BioHMM and HMM the results are merged, by the mergeLevels algorithm, as recommended in Willenbrock and Fridlyand, 2005. Merging is also done in GLAD (with GLAD's own merging algorithm). Merging is the default for DNAcopy (but this can be set by the user). Merging can also be used with the wavelet-based method of Hsu et al.; please note that the later is an experimental feature implemented by us, and there is no evidence of its performance.
Using the output of mergeLevels, we then map the results to states of "Alteration", so that we categorize each probe as taking one, and only one, of three possible values, -1 (loss of genomic DNA), 0 (no change in DNA content), +1 (gain of genomic DNA). We have made the assumption, in this mapping, that the "no change" class is the one that has the absolute value closest to zero, and any other classes are either gains or losses. When the data are normalized, the "no change" class should be the most common one. For HaarSeg, we use the approach in page i141 of Ben-Yaacov and Eldar, section 2.3, "Determining aberrant intervals": a MAD (per their definition) is computed and any segment with absolute value larger than m * MAD is considered aberrant.
For PSW, the method implemented here is a small departure from the one described originally in Price et al.: here we use all the data for a subject/array for the thresholding, but subsequent analyses (finding islands and robustness analysis and permutation tests) are carried out chromosome by chromosome. This should allow to detect cases where a complete chromosome is gained or lost, as well as cases where smaller regions are gained or lost. Note that this approach is probably not advisable for the X chromosome, at least for males. You probably want to analyze the X separately, since you only expect one copy of the genomic DNA in males. Finally, the function for calculating the threshold (the default used by T. Price originally, which we also use ---median + 0.2 * MAD---) was validated by Price and collaborators on a 75-probe dataset that spanned the telomeric 2MB of chromosome 16p, using subjects with well-characterized deletions in the region and healthy controls. But it has not been validated with whole-genome chips or other DNA sources. (If you want to play around with the code, the sources are available from http://www.well.ox.ac.uk/~tprice/cgh/ and as the R package
cgh). (I thank Tom Price for his discussion on these issues; some sentences here are copied almost verbatim from his emails.)
Beware that, with DNAcopy, parallelization over arrays by chromosome can lead to slightly different results than parallelization over arrays (and, thus, to the original function). This is because of the way the smoothing is carried out, which affects the timmed.SD used.
All the functions return an object of class "adacgh.generic.out". In addition, depending on the function, the returned object can also be of class "CGH.wave.merged","CGH.wave", "mergedHMM", "adacghGLAD", "mergedBioHMM", "CGHseg", "CGH.PSW".
For methods DNAcopy, HMM, BioHMM, GLAD, CGHseg, and wavelet smoothing the output object is a list with two components:
segm component is a list, with as many components as subjects or arrays. Each of these lists is a matrix with three columns. The first are the observed data, the second the predicted/smoothed/merged data, and the third the state/alteration. When the method performs a merging step, we call the third column "Alteration", and it can only take three values, -1 (loss of genomic DNA), 0, +1 (gain of genomic DNA).
Method PSW's output contains two components. The second,
plotData with data that is used for plotting (and the details are not docummented, since you are not supposed to use them, as this is subject to change). The first component,
Data, is a list with as many components as arrays or subjects. Each of this is an array with columns with original data, the sign, the value from the robustness test, and the p-value from the permutation test. See details in
Method ACE's output is a list. None of the elements of this list are to be used directly by a user, and are later used by the function
- The output from the segmentation
- The same value as chrom.numeric in the function call.
Fridlyand, Jane and Snijders, Antoine M. and Pinkel, Dan and Albertson, Donna G. (2004). Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis, 90: 132--153. Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P. (2005) Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics, 6:211-26.
Hupe, P. and Stransky, N. and Thiery, J. P. and Radvanyi, F. and Barillot, E. (2004). Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20: 3413--3422.
Lingjaerde OC, Baumbusch LO, Liestol K, Glad I, Borresen-Dale AL. (2005). CGH-Explorer: a program for analysis of CGH-data. Bioinformatics, 21: 821--822.
Marioni, J. C. and Thorne, N. P. and Tavare, S. (2006). BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics, 22: 1144--1146.
Olshen, A. B. and Venkatraman, E. S. and Lucito, R. and Wigler, M. (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 4, 557--572. http://www.mskcc.org/biostat/~olshena/research.
Picard, F. and Robin, S. and Lavielle, M. and Vaisse, C. and Daudin, J. J. (2005). A statistical approach for array CGH data analysis. BMC Bioinformatics, 6, 27. http://dx.doi.org/10.1186/1471-2105-6-27.
Price TS, Regan R, Mott R, Hedman A, Honey B, Daniels RJ, Smith L, Greenfield A, Tiganescu A, Buckle V, Ventress N, Ayyub H, Salhan A, Pedraza-Diaz S, Broxholme J, Ragoussis J, Higgs DR, Flint J, Knight SJ. (2005) SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucleic Acids Res., 33:3455-64.
Willenbrock, H. and Fridlyand, J. (2005). A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics, 21, 4084--4091.
Diaz-Uriarte, R. and Rueda, O.M. (2006). ADaCGH: an R package and web-based application for the analysis of aCGH data. Tech. report. http://www.ligarto.org/rdiaz/Papers/adacgh.pdf, http://adacgh.bioinfo.cnio.es.
Ben-Yaacov, E. and Eldar, Y.C. (2008). A Fast and Flexible Method for the Segmentation of aCGH Data, Bioinformatics, 24: i139-i145.
## Not run: ## all slaves need a common dir to read and write. dir.to.create <- tempdir2() setwd(dir.to.create) dir.to.create mpiInit() ## this is used in SegmentPlotWrite, which is run first. ## End(Not run) ## library(Rmpi) ## try(mpi.spawn.Rslaves(nslaves = 1), silent = TRUE) data(cghE1) tmpchr <- sub("chr", "", cghE1$Chromosome) chrom.numeric <- as.numeric(as.character(tmpchr)) chrom.numeric[tmpchr == "X"] <- 23 chrom.numeric[tmpchr == "Y"] <- 24 rm(tmpchr) ### we need the data ordered reorder <- order(chrom.numeric, cghE1$UG.Start, cghE1$UG.End, cghE1$Name) cghE1 <- cghE1[reorder, ] chrom.numeric <- chrom.numeric[reorder] ## run all methods cbs.out <- pSegmentDNAcopy(cghE1[, 5:7], chrom.numeric) cbs.ns.out <- pSegmentDNAcopy(cghE1[, 5:7], chrom.numeric, smooth = FALSE) glad.out <- pSegmentGLAD(cghE1[, 5:7], chrom.numeric) cghseg.out <- pSegmentCGHseg(cghE1[, 5:7], chrom.numeric, CGHseg.thres = -0.05) ace.out <- pSegmentACE(cghE1[, 5:7], chrom.numeric) hmm.out <- pSegmentHMM(cghE1[, 5:7], chrom.numeric) wave.out <- pSegmentWavelets(cghE1[, 5:7], chrom.numeric) wave.nm.out <- pSegmentWavelets(cghE1[, 5:7], chrom.numeric, mergeSegs = FALSE) haar.out <- pSegmentHaarSeg(cghE1[, 5:7], chrom.numeric) ## Next onew we try, as it could fail when it should not. try(psw.pos.out <- pSegmentPSW(cghE1[, 5:7], chrom.numeric, sign = 1)) try(psw.neg.out <- pSegmentPSW(cghE1[, 5:7], chrom.numeric, sign = -1)) ## BioHMM is the only one that uses distances ## it is the slowest, so do only two ## We use try, as snapCGH sometimes fails and does not load its compiled C code try(biohmm.out <- pSegmentBioHMM(cghE1[, 5:6], chrom.numeric, cghE1$UG.Start)) ## With ACE we need to choose fdr ace.out.sum <- summary(ace.out) #### Writing common <- cghE1[, 1:4] writeResults(hmm.out, cghE1[, 5:7], common) writeResults(glad.out, cghE1[, 5:7], common) writeResults(cghseg.out, cghE1[, 5:7], common) writeResults(ace.out.sum, cghE1[, 5:7], common) writeResults(wave.out, cghE1[, 5:7], common) writeResults(wave.nm.out, cghE1[, 5:7], common) writeResults(cbs.out, cghE1[, 5:7], common) writeResults(haar.out, cghE1[, 5:7], common) try(writeResults(psw.pos.out, common)) try(writeResults(psw.neg.out, common)) try(writeResults(biohmm.out, cghE1[, 5:6], common)) ### plot all ## Not run: ## This will not work unless you have a working Python installation yminmax <- c(min(as.matrix(cghE1[, 5:7])), max(as.matrix(cghE1[, 5:7]))) segmentPlot(hmm.out, geneNames = cghE1[, 1], yminmax = yminmax, idtype = "ug", organism = "Hs") segmentPlot(glad.out, geneNames = cghE1[, 1], yminmax = yminmax, idtype = "ug", organism = "Hs") segmentPlot(cghseg.out, geneNames = cghE1[, 1], yminmax = yminmax, idtype = "ug", organism = "Hs") segmentPlot(wave.out, geneNames = cghE1[, 1], yminmax = yminmax, idtype = "ug", organism = "Hs") segmentPlot(wave.out, geneNames = cghE1[, 1], yminmax = yminmax, idtype = "ug", organism = "Hs") segmentPlot(haar.out, geneNames = cghE1[, 1], yminmax = yminmax, idtype = "ug", organism = "Hs") segmentPlot(cbs.out, geneNames = cghE1[, 1], yminmax = yminmax, idtype = "ug", organism = "Hs") try(segmentPlot(psw.pos.out, geneNames = cghE1[, 1], yminmax = yminmax, idtype = "ug", organism = "Hs")) try(segmentPlot(psw.neg.out, geneNames = cghE1[, 1], yminmax = yminmax, idtype = "ug", organism = "Hs")) try(segmentPlot(biohmm.out, geneNames = cghE1[, 1], yminmax = yminmax, idtype = "ug", organism = "Hs")) segmentPlot(ace.out.sum, geneNames = cghE1[, 1], yminmax = yminmax, idtype = "ug", organism = "Hs") ## End(Not run) ## To verify we are doing the right thing with Picard et al.'s method, ## compare with their figure 1. Need to get the txt files with the ## original data. ## We got the data from the GLAD BioC package, and then exported the txt. ## For convenience, it is in the "/inst/txt-data" directory of our package. ## This verifies the output (look at ADaCGH.R for further details and other ## comparisons): ## coriel.data <- read.table("gm03563.txt", header = TRUE) ## cd3 <- coriel.data[coriel.data$Chromosome == 3, 3] ## out.our <- CGHsegWrapper(cd3, rep(1, length(cd3)), s = -0.5) ## k = 2 ## out.our ## so our implementation seems correct. Look at where the ## breakpoint is located, etc, and it is like figure 1 of ## Picard's paper.
Documentation reproduced from package ADaCGH, version 1.5-3. License: GPL (>= 3)