# pvclust {pvclust}

### Description

calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Hierarchical clustering is done for given data and p-values are computed for each of the clusters.

### Usage

pvclust(data, method.hclust="average", method.dist="correlation", use.cor="pairwise.complete.obs", nboot=1000, parallel=FALSE, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE, iseed=NULL, quiet=FALSE) parPvclust(cl=NULL, data, method.hclust="average", method.dist="correlation", use.cor="pairwise.complete.obs", nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE, init.rand=NULL, iseed=NULL, quiet=FALSE)

### Arguments

- data
- numeric data matrix or data frame.
- method.hclust
- the agglomerative method used in hierarchical clustering. This should be (an abbreviation of) one of
`"average"`

,`"ward.D"`

,`"ward.D2"`

,`"single"`

,`"complete"`

,`"mcquitty"`

,`"median"`

or`"centroid"`

. The default is`"average"`

. See`method`

argument in`hclust`

. - method.dist
- the distance measure to be used. This should be a character string, or a function which returns a
`dist`

object. A character string should be (an abbreviation of) one of`"correlation"`

,`"uncentered"`

,`"abscor"`

or those which are allowed for`method`

argument in`dist`

function. The default is`"correlation"`

. See*details*section in this help and`method`

argument in`dist`

. - use.cor
- character string which specifies the method for computing correlation with data including missing values. This should be (an abbreviation of) one of
`"all.obs"`

,`"complete.obs"`

or`"pairwise.complete.obs"`

. See the`use`

argument in`cor`

function. - nboot
- the number of bootstrap replications. The default is
`1000`

. - parallel
- switch for parallel computation. If
`FALSE`

the computation is done in non-parallel mode. If`TRUE`

or a positive integer is supplied, parallel computation is done with automatically generated PSOCK cluster. Use`TRUE`

for default cluster size (`parallel::detectCores() - 1`

), or specify the size by an integer. If a`cluster`

object is supplied the cluster is used for parallel computation. Note that`NULL`

is currently not allowed for using the default cluster. - r
- numeric vector which specifies the relative sample sizes of bootstrap replications. For original sample size n and bootstrap sample size n', this is defined as r=n'/n.
- store
- locical. If
`store=TRUE`

, all bootstrap replications are stored in the output object. The default is`FALSE`

. - cl
- a cluster object created by package parallel or snow. If NULL, use the registered default cluster.
- weight
- logical. If
`weight=TRUE`

, resampling is made by weight vector instead of index vector. Useful for large`r`

value (`r>10`

). Currently, available only for distance`"correlation"`

and`"abscor"`

. - init.rand
- logical. If
`init.rand=TRUE`

, random number generators are initialized. Use`iseed`

argument to achieve reproducible results.**This argument is duplicated and will be unavailable in the future.** - iseed
- An integer. If non-
`NULL`

value is supplied random number generators are initialized. It is passed to`set.seed`

or`clusterSetRNGStream`

. - quiet
- logical. If
`TRUE`

it does not report the progress.

### Details

Function `pvclust`

conducts multiscale bootstrap resampling to calculate p-values for each cluster in the result of hierarchical clustering. `parPvclust`

is the parallel version of this procedure which depends on package parallel for parallel computation.

For data expressed as (n, p) matrix or data frame, we assume that the data is n observations of p objects, which are to be clustered. The i'th row vector corresponds to the i'th observation of these objects and the j'th column vector corresponds to a sample of j'th object with size n.

There are several methods to measure the dissimilarities between objects. For data matrix X, `"correlation"`

method takes for dissimilarity between j'th and k'th object, where cor is function \code{cor}.

`"uncentered"`

takes uncentered sample correlation and `"abscor"`

takes the absolute value of sample correlation

### Values

- hclust
- hierarchical clustering for original data generated by function
`hclust`

. See`hclust`

for details. - edges
- data frame object which contains p-values and supporting informations such as standard errors.
- count
- data frame object which contains primitive information about the result of multiscale bootstrap resampling.
- msfit
- list whose elements are results of curve fitting for multiscale bootstrap resampling, of class
`msfit`

. See`msfit`

for details. - nboot
- numeric vector of number of bootstrap replications.
- r
- numeric vector of the relative sample size for bootstrap replications.
- store
- list contains bootstrap replications if
`store=TRUE`

was given for function`pvclust`

or`parPvclust`

.

### References

Suzuki, R. and Shimodaira, H. (2006) "Pvclust: an R package for assessing the uncertainty in hierarchical clustering", *Bioinformatics*, 22 (12): 1540-1542. Shimodaira, H. (2004) "Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling", *Annals of Statistics*, 32, 2616-2641.

Shimodaira, H. (2002) "An approximately unbiased test of phylogenetic tree selection", *Systematic Biology*, 51, 492-508.

Suzuki, R. and Shimodaira, H. (2004) "An application of multiscale bootstrap resampling to hierarchical clustering of microarray data: How accurate are these clusters?", *The Fifteenth International Conference on Genome Informatics 2004*, P034.

### See Also

`lines.pvclust`

, `print.pvclust`

, `msfit`

, `plot.pvclust`

, `text.pvclust`

, `pvrect`

and `pvpick`

.

### Examples

### example using Boston data in package MASS data(Boston, package = "MASS") ## multiscale bootstrap resampling (non-parallel) boston.pv <- pvclust(Boston, nboot=100, parallel=FALSE) ## CAUTION: nboot=100 may be too small for actual use. ## We suggest nboot=1000 or larger. ## plot/print functions will be useful for diagnostics. ## plot dendrogram with p-values plot(boston.pv) ask.bak <- par()$ask par(ask=TRUE) ## highlight clusters with high au p-values pvrect(boston.pv) ## print the result of multiscale bootstrap resampling print(boston.pv, digits=3) ## plot diagnostic for curve fitting msplot(boston.pv, edges=c(2,4,6,7)) par(ask=ask.bak) ## print clusters with high p-values boston.pp <- pvpick(boston.pv) boston.pp ### Using a custom distance measure ## Define a distance function which returns an object of class "dist". ## The function must have only one argument "x" (data matrix or data.frame). cosine <- function(x) { x <- as.matrix(x) y <- t(x) %*% x res <- 1 - y / (sqrt(diag(y)) %*% t(sqrt(diag(y)))) res <- as.dist(res) attr(res, "method") <- "cosine" return(res) } result <- pvclust(Boston, method.dist=cosine, nboot=100) plot(result) ## Not run: ### parallel computation result.par <- pvclust(Boston, nboot=1000, parallel=TRUE) plot(result.par) ## End(Not run)

Documentation reproduced from package pvclust, version 2.0-0. License: GPL (>= 2)