Compute silhouette information according to a given clustering in k clusters.
silhouette(x, ...) ## S3 method for class 'default': silhouette((x, dist, dmatrix, ...)) ## S3 method for class 'partition': silhouette((x, ...)) ## S3 method for class 'clara': silhouette((x, full = FALSE, ...) sortSilhouette(object, ...)) ## S3 method for class 'silhouette': summary((object, FUN = mean, ...)) ## S3 method for class 'silhouette': plot((x, nmax.lab = 40, max.strlen = 5, main = NULL, sub = NULL, xlab = expression("Silhouette width "* s[i]), col = "gray", do.col.sort = length(col) > 1, border = 0, cex.names = par("cex.axis"), do.n.k = TRUE, do.clus.stat = TRUE, ...))
- an object of appropriate class; for the
defaultmethod an integer vector with k different integer cluster codes or a list with such an
x$clusteringcomponent. Note that silhouette statistics are only defined if 2 <= k <= n-1.
- a dissimilarity object inheriting from class
distor coercible to one. If not specified,
- a symmetric dissimilarity matrix (n * n), specified instead of
dist, which can be more efficient.
- logical specifying if a full silhouette should be computed for
claraobject. Note that this requires O(n^2) memory, since the full dissimilarity (see
daisy) is needed internally.
- an object of class
- further arguments passed to and from methods.
- function used to summarize silhouette widths.
- integer indicating the number of labels which is considered too large for single-name labeling the silhouette plot.
- positive integer giving the length to which strings are truncated in silhouette plot labeling.
- main, sub, xlab
- arguments to
title; have a sensible non-NULL default here.
- col, border, cex.names
- arguments passed
barplot(); note that the default used to be
col = heat.colors(n), border = par("fg")instead.
colcan also be a color vector of length k for clusterwise coloring, see also
- logical indicating if the colors
colshould be sorted “along” the silhouette; this is useful for casewise or clusterwise coloring.
- logical indicating if n and k “title text” should be written.
- logical indicating if cluster size and averages should be written right to the silhouettes.
For each observation i, the silhouette width s(i) is defined as follows:
Put a(i) = average dissimilarity between i and all other points of the cluster to which i belongs (if i is the only observation in its cluster, s(i) := 0 without further calculations). For all other clusters C, put d(i,C) = average dissimilarity of i to all observations of C. The smallest of these d(i,C) is b(i) := \min_C d(i,C), and can be seen as the dissimilarity between i and its “neighbor” cluster, i.e., the nearest one to which it does not belong. Finally,
silhouette.default() is now based on C code donated by Romain Francois (the R version being still available as
Observations with a large s(i) (almost 1) are very well clustered, a small s(i) (around 0) means that the observation lies between two clusters, and observations with a negative s(i) are probably placed in the wrong cluster.
silhouette() returns an object,
sil, of class
silhouette which is an [n x 3] matrix with attributes. For each observation i,
sil[i,] contains the cluster to which i belongs as well as the neighbor cluster of i (the cluster, not containing i, for which the average dissimilarity between its observations and i is minimal), and the silhouette width s(i) of the observation. The
colnames correspondingly are
c("cluster", "neighbor", "sil_width").
summary(sil) returns an object of class
summary.silhouette, a list with components
sortSilhouette(sil) orders the rows of
sil as in the silhouette plot, by cluster (increasingly) and decreasing silhouette width s(i).
attr(sil, "Ordered") is a logical indicating if
sil is ordered as by
sortSilhouette(). In that case,
rownames(sil) will contain case labels or numbers, and
attr(sil, "iOrd") the ordering index vector.
summaryof the individual silhouette widths s(i).
- numeric (rank 1) array of clusterwise means of silhouette widths where
mean = FUNis used.
- the total mean
sare the individual silhouette widths.
tableof the k cluster sizes.
- if available, the call creating
- logical identical to
attr(sil, "Ordered"), see below.
Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53--65.
chapter 2 of Kaufman, L. and Rousseeuw, P.J. (1990), see the references in
silhouette() is intrinsic to the
partition clusterings, and hence has a (trivial) method for these, it is straightforward to get silhouettes from hierarchical clusterings from
cutree() and distance as input.
By default, for
clara() partitions, the silhouette is just for the best random subset used. Use
full = TRUE to compute (and later possibly plot) the full silhouette.
data(ruspini) pr4 <- pam(ruspini, 4) str(si <- silhouette(pr4)) (ssi <- summary(si)) plot(si) # silhouette plot plot(si, col = c("red", "green", "blue", "purple"))# with cluster-wise coloring si2 <- silhouette(pr4$clustering, dist(ruspini, "canberra")) summary(si2) # has small values: "canberra"'s fault plot(si2, nmax= 80, cex.names=0.6) op <- par(mfrow= c(3,2), oma= c(0,0, 3, 0), mgp= c(1.6,.8,0), mar= .1+c(4,2,2,2)) for(k in 2:6) plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE) mtext("PAM(Ruspini) as in Kaufman & Rousseeuw, p.101", outer = TRUE, font = par("font.main"), cex = par("cex.main")); frame() ## the same with cluster-wise colours: c6 <- c("tomato", "forest green", "dark blue", "purple2", "goldenrod4", "gray20") for(k in 2:6) plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE, col = c6[1:k]) par(op) ## clara(): standard silhouette is just for the best random subset data(xclara) set.seed(7) str(xc1k <- xclara[sample(nrow(xclara), size = 1000) ,]) cl3 <- clara(xc1k, 3) plot(silhouette(cl3))# only of the "best" subset of 46 ## The full silhouette: internally needs large (36 MB) dist object: sf <- silhouette(cl3, full = TRUE) ## this is the same as s.full <- silhouette(cl3$clustering, daisy(xc1k)) if(paste(R.version$major, R.version$minor, sep=".") >= "2.3.0") stopifnot(all.equal(sf, s.full, check.attributes = FALSE, tol = 0)) ## color dependent on original "3 groups of each 1000": plot(sf, col = 2+ as.integer(names(cl3$clustering) ) %/% 1000, main ="plot(silhouette(clara(.), full = TRUE))") ## Silhouette for a hierarchical clustering: ar <- agnes(ruspini) si3 <- silhouette(cutree(ar, k = 5), # k = 4 gave the same as pam() above daisy(ruspini)) plot(si3, nmax = 80, cex.names = 0.5) ## 2 groups: Agnes() wasn't too good: si4 <- silhouette(cutree(ar, k = 2), daisy(ruspini)) plot(si4, nmax = 80, cex.names = 0.5)
Documentation reproduced from package cluster, version 1.14.4. License: GPL (>= 2)