# pvclust {pvclust}

### Description

calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Hierarchical clustering is done for given data and p-values are computed for each of the clusters.

### Usage

pvclust(data, method.hclust="average", method.dist="correlation", use.cor="pairwise.complete.obs", nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE) parPvclust(cl, data, method.hclust="average", method.dist="correlation", use.cor="pairwise.complete.obs", nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE, init.rand=TRUE, seed=NULL)

### Arguments

- data
- numeric data matrix or data frame.
- method.hclust
- the agglomerative method used in hierarchical clustering. This should be (an abbreviation of) one of
`"average"`

,`"ward"`

,`"single"`

,`"complete"`

,`"mcquitty"`

,`"median"`

or`"centroid"`

. The default is`"average"`

. See`method`

argument in`hclust`

. - method.dist
- the distance measure to be used. This should be (an abbreviation of) one of
`"correlation"`

,`"uncentered"`

,`"abscor"`

or those which are allowed for`method`

argument in`dist`

function. The default is`"correlation"`

. See*details*section in this help and`method`

argument in`dist`

. - use.cor
- character string which specifies the method for computing correlation with data including missing values. This should be (an abbreviation of) one of
`"all.obs"`

,`"complete.obs"`

or`"pairwise.complete.obs"`

. See the`use`

argument in`cor`

function. - nboot
- the number of bootstrap replications. The default is
`1000`

. - r
- numeric vector which specifies the relative sample sizes of bootstrap replications. For original sample size n and bootstrap sample size n', this is defined as r=n'/n.
- store
- locical. If
`store=TRUE`

, all bootstrap replications are stored in the output object. The default is`FALSE`

. - cl
- snow cluster object which may be generated by function
`makeCluster`

. See`snow-startstop`

in snow package. - weight
- logical. If
`weight=TRUE`

, resampling is made by weight vector instead of index vector. Useful for large`r`

value (`r>10`

). Currently, available only for distance`"correlation"`

and`"abscor"`

. - init.rand
- logical. If
`init.rand=TRUE`

, random number generators are initialized at child processes. Random seeds can be set by`seed`

argument. - seed
- integer vector of random seeds. It should have the same length as
`cl`

. If`NULL`

is specified,`1:length(cl)`

is used as seed vector. The default is`NULL`

.

### Details

Function `pvclust`

conducts multiscale bootstrap resampling to calculate p-values for each cluster in the result of hierarchical clustering. `parPvclust`

is the parallel version of this procedure which depends on snow package for parallel computation.

For data expressed as (n, p) matrix or data frame, we assume that the data is n observations of p objects, which are to be clustered. The i'th row vector corresponds to the i'th observation of these objects and the j'th column vector corresponds to a sample of j'th object with size n.

There are several methods to measure the dissimilarities between objects. For data matrix X, `"correlation"`

method takes for dissimilarity between j'th and k'th object, where cor is function \code{cor}.

`"uncentered"`

takes uncentered sample correlation and `"abscor"`

takes the absolute value of sample correlation

### Values

- hclust
- hierarchical clustering for original data generated by function
`hclust`

. See`hclust`

for details. - edges
- data frame object which contains p-values and supporting informations such as standard errors.
- count
- data frame object which contains primitive information about the result of multiscale bootstrap resampling.
- msfit
- list whose elements are results of curve fitting for multiscale bootstrap resampling, of class
`msfit`

. See`msfit`

for details. - nboot
- numeric vector of number of bootstrap replications.
- r
- numeric vector of the relative sample size for bootstrap replications.
- store
- list contains bootstrap replications if
`store=TRUE`

was given for function`pvclust`

or`parPvclust`

.

### References

Shimodaira, H. (2004) "Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling", *Annals of Statistics*, 32, 2616-2641.

Shimodaira, H. (2002) "An approximately unbiased test of phylogenetic tree selection", *Systematic Biology*, 51, 492-508.

Suzuki, R. and Shimodaira, H. (2004) "An application of multiscale bootstrap resampling to hierarchical clustering of microarray data: How accurate are these clusters?", *The Fifteenth International Conference on Genome Informatics 2004*, P034.

### See Also

`lines.pvclust`

, `print.pvclust`

, `msfit`

, `plot.pvclust`

, `text.pvclust`

, `pvrect`

and `pvpick`

.

### Examples

## using Boston data in package MASS library(MASS) data(Boston) ## multiscale bootstrap resampling boston.pv <- pvclust(Boston, nboot=100) ## CAUTION: nboot=100 may be too small for actual use. ## We suggest nboot=1000 or larger. ## plot/print functions will be useful for diagnostics. ## plot dendrogram with p-values plot(boston.pv) ask.bak <- par()$ask par(ask=TRUE) ## highlight clusters with high au p-values pvrect(boston.pv) ## print the result of multiscale bootstrap resampling print(boston.pv, digits=3) ## plot diagnostic for curve fitting msplot(boston.pv, edges=c(2,4,6,7)) par(ask=ask.bak) ## Print clusters with high p-values boston.pp <- pvpick(boston.pv) boston.pp ## Not run: ## parallel computation via snow package library(snow) cl <- makeCluster(10, type="MPI") ## parallel version of pvclust boston.pv <- parPvclust(cl, Boston, nboot=1000) ## End(Not run)

