Skip to Content

kmeans {stats}

K-Means Clustering
Package: 
stats
Version: 
R 2.15.3

Description

Perform k-means clustering on a data matrix.

Usage

kmeans(x, centers, iter.max = 10, nstart = 1,
       algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
                     "MacQueen"))
 
## S3 method for class 'kmeans':
fitted((object, method = c("centers", "classes"), ...))

Arguments

x
numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
centers
either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres.
iter.max
the maximum number of iterations allowed.
nstart
if centers is a number, how many random sets should be chosen?
algorithm
character: may be abbreviated.
object
an R object of class "kmeans", typically the result ob of ob <- kmeans(..).
method
character: may be abbreviated. "centers" causes fitted to return cluster centers (one for each input point) and "classes" causes fitted to return a vector of class assignments.
...
not used.

Details

The data given by x is clustered by the k-means method, which aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centres is minimized. At the minimum, all cluster centres are at the mean of their Voronoi sets (the set of data points which are nearest to the cluster centre).

The algorithm of Hartigan and Wong (1979) is used by default. Note that some authors use k-means to refer to a specific algorithm rather than the general method: most commonly the algorithm given by MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy (1965). The Hartigan--Wong algorithm generally does a better job than either of those, but trying several random starts (nstart) is often recommended. For ease of programmatic exploration, k=1 is allowed, notably returning the center and withinss.

Except for the Lloyd--Forgy method, k clusters will always be returned if a number is specified. If an initial matrix of centres is supplied, it is possible that no point will be closest to one or more centres, which is currently an error for the Hartigan--Wong method.

Values

kmeans returns an object of class "kmeans" which has a print and a fitted method. It is a list with components:

cluster
A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
centers
A matrix of cluster centres.
totss
The total sum of squares.
withinss
Vector of within-cluster sum of squares, one component per cluster.
tot.withinss
Total within-cluster sum of squares, i.e., sum(withinss).
betweenss
The between-cluster sum of squares, i.e. totss-tot.withinss.
size
The number of points in each cluster.

References

Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21, 768--769.

Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100--108.

Lloyd, S. P. (1957, 1982) Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory 28, 128--137.

MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281--297. Berkeley, CA: University of California Press.

Examples

require(graphics)
 
# a 2-dimensional example
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
 
# sum of squares
ss <- function(x) sum(scale(x, scale = FALSE)^2)
 
## cluster centers "fitted" to each obs.:
fitted.x <- fitted(cl);  head(fitted.x)
resid.x <- x - fitted(cl)
 
## Equalities : ----------------------------------
cbind(cl[c("betweenss", "tot.withinss", "totss")], # the same two columns
         c(ss(fitted.x), ss(resid.x),    ss(x)))
stopifnot(all.equal(cl$ totss,        ss(x)),
      all.equal(cl$ tot.withinss, ss(resid.x)),
      ## these three are the same:
      all.equal(cl$ betweenss,    ss(fitted.x)),
      all.equal(cl$ betweenss, cl$totss - cl$tot.withinss),
      ## and hence also
      all.equal(ss(x), ss(fitted.x) + ss(resid.x))
      )
 
kmeans(x,1)$withinss # trivial one-cluster, (its W.SS == ss(x))
 
## random starts do help here with too many clusters
## (and are often recommended anyway!):
(cl <- kmeans(x, 5, nstart = 25))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:5, pch = 8)

Documentation reproduced from R 2.15.3. License: GPL-2.