Skip to Content

pvclust {pvclust}

Calculating P-values for Hierchical Clustering
Package: 
pvclust
Version: 
1.2-2

Description

calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Hierarchical clustering is done for given data and p-values are computed for each of the clusters.

Usage

pvclust(data, method.hclust="average",
        method.dist="correlation", use.cor="pairwise.complete.obs",
        nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE)

parPvclust(cl, data, method.hclust="average",
           method.dist="correlation", use.cor="pairwise.complete.obs",
           nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE,
           init.rand=TRUE, seed=NULL)

Arguments

data
numeric data matrix or data frame.
method.hclust
the agglomerative method used in hierarchical clustering. This should be (an abbreviation of) one of "average", "ward", "single", "complete", "mcquitty", "median" or "centroid". The default is "average". See method argument in hclust.
method.dist
the distance measure to be used. This should be (an abbreviation of) one of "correlation", "uncentered", "abscor" or those which are allowed for method argument in dist function. The default is "correlation". See details section in this help and method argument in dist.
use.cor
character string which specifies the method for computing correlation with data including missing values. This should be (an abbreviation of) one of "all.obs", "complete.obs" or "pairwise.complete.obs". See the use argument in cor function.
nboot
the number of bootstrap replications. The default is 1000.
r
numeric vector which specifies the relative sample sizes of bootstrap replications. For original sample size n and bootstrap sample size n', this is defined as r=n'/n.
store
locical. If store=TRUE, all bootstrap replications are stored in the output object. The default is FALSE.
cl
snow cluster object which may be generated by function makeCluster. See snow-startstop in snow package.
weight
logical. If weight=TRUE, resampling is made by weight vector instead of index vector. Useful for large r value (r>10). Currently, available only for distance "correlation" and "abscor".
init.rand
logical. If init.rand=TRUE, random number generators are initialized at child processes. Random seeds can be set by seed argument.
seed
integer vector of random seeds. It should have the same length as cl. If NULL is specified, 1:length(cl) is used as seed vector. The default is NULL.

Details

Function pvclust conducts multiscale bootstrap resampling to calculate p-values for each cluster in the result of hierarchical clustering. parPvclust is the parallel version of this procedure which depends on snow package for parallel computation.

For data expressed as (n, p) matrix or data frame, we assume that the data is n observations of p objects, which are to be clustered. The i'th row vector corresponds to the i'th observation of these objects and the j'th column vector corresponds to a sample of j'th object with size n.

There are several methods to measure the dissimilarities between objects. For data matrix X, "correlation" method takes for dissimilarity between j'th and k'th object, where cor is function \code{cor}.

"uncentered" takes uncentered sample correlation and "abscor" takes the absolute value of sample correlation

Values

hclust
hierarchical clustering for original data generated by function hclust. See hclust for details.
edges
data frame object which contains p-values and supporting informations such as standard errors.
count
data frame object which contains primitive information about the result of multiscale bootstrap resampling.
msfit
list whose elements are results of curve fitting for multiscale bootstrap resampling, of class msfit. See msfit for details.
nboot
numeric vector of number of bootstrap replications.
r
numeric vector of the relative sample size for bootstrap replications.
store
list contains bootstrap replications if store=TRUE was given for function pvclust or parPvclust.

References

Shimodaira, H. (2004) "Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling", Annals of Statistics, 32, 2616-2641.

Shimodaira, H. (2002) "An approximately unbiased test of phylogenetic tree selection", Systematic Biology, 51, 492-508.

Suzuki, R. and Shimodaira, H. (2004) "An application of multiscale bootstrap resampling to hierarchical clustering of microarray data: How accurate are these clusters?", The Fifteenth International Conference on Genome Informatics 2004, P034.

http://www.is.titech.ac.jp/~shimo/prog/pvclust/

See Also

lines.pvclust, print.pvclust, msfit, plot.pvclust, text.pvclust, pvrect and pvpick.

Examples

## using Boston data in package MASS
library(MASS)
data(Boston)
 
## multiscale bootstrap resampling
boston.pv <- pvclust(Boston, nboot=100)
 
## CAUTION: nboot=100 may be too small for actual use.
##          We suggest nboot=1000 or larger.
##          plot/print functions will be useful for diagnostics.
 
## plot dendrogram with p-values
plot(boston.pv)
 
ask.bak <- par()$ask
par(ask=TRUE)
 
## highlight clusters with high au p-values
pvrect(boston.pv)
 
## print the result of multiscale bootstrap resampling
print(boston.pv, digits=3)
 
## plot diagnostic for curve fitting
msplot(boston.pv, edges=c(2,4,6,7))
 
par(ask=ask.bak)
 
## Print clusters with high p-values
boston.pp <- pvpick(boston.pv)
boston.pp
 
 
## Not run:
## parallel computation via snow package
library(snow)
cl <- makeCluster(10, type="MPI")
 
## parallel version of pvclust
boston.pv <- parPvclust(cl, Boston, nboot=1000)
## End(Not run)

Author(s)

Ryota Suzuki ryota.suzuki@is.titech.ac.jp

Documentation reproduced from package pvclust, version 1.2-2. License: GPL (>= 2)