energy.hclust {energy} | R Documentation |
Performs hierarchical clustering on a set of Euclidean distance dissimilarities by minimum (energy) E-distance method.
energy.hclust(dst)
dst |
dissimilarity object produced by dist with
method=euclidean , or lower triangle of distance
matrix as vector in column order. If dst is a square
matrix, the lower triangle is interpreted as a vector of
distances. |
This function performs agglomerative hierarchical cluster analysis
based on the pairwise distances between sample elements in dst
.
Initially, each of the n singletons is a cluster. At each of n-1 steps, the
procedure merges the pair of clusters with minimum e-distance.
The e-distance
between two clusters C_i, C_j of sizes n_i, n_j is given by
e(C_i, C_j)=frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}],
where
M_{ij} = 1/(n_i n_j) sum[1:n_i, 1:n_j] ||X_(ip) - X_(jq)||,
|| || denotes Euclidean norm, and X_(ip) denotes the p-th observation in the i-th cluster.
The return value is an object of class hclust
, so hclust
methods such as print or plot methods, plclust
, and cutree
are available. See the documentation for hclust
.
The e-distance measures both the heterogeneity between clusters and the homogeneity within clusters. E-clustering is particularly effective in high dimension, and is more effective than some standard hierarchical methods when clusters have equal means (see example below). For other advantages see the references.
An object of class hclust
which describes the tree produced by
the clustering process. The object is a list with components:
merge: |
an n-1 by 2 matrix, where row i of merge describes the
merging of clusters at step i of the clustering. If an element j in the
row is negative, then observation -j was merged at this
stage. If j is positive then the merge was with the cluster
formed at the (earlier) stage j of the algorithm. |
height: |
the clustering height: a vector of n-1 non-decreasing real numbers (the e-distance between merging clusters) |
order: |
a vector giving a permutation of the indices of
original observations suitable for plotting, in the sense that a
cluster plot using this ordering and matrix merge will not have
crossings of the branches. |
labels: |
labels for each of the objects being clustered. |
call: |
the call which produced the result. |
method: |
the cluster method that has been used (e-distance). |
dist.method: |
the distance that has been used to create dst . |
Maria L. Rizzo rizzo@math.ohiou.edu and Gabor J. Szekely gabors@bgnet.bgsu.edu
Szekely, G. J. and Rizzo, M. L. (2003) Hierarchical Clustering via Joint Between-Within Distances, submitted.
Szekely, G. J. and Rizzo, M. L. (2003) Testing for Equal Distributions in High Dimension, submitted.
Szekely, G. J. (2000) E-statistics: Energy of Statistical Samples, preprint.
## Not run: library(cluster) data(animals) plot(energy.hclust(dist(animals))) ## End(Not run) library(mva) data(USArrests) ecl <- energy.hclust(dist(USArrests)) print(ecl) plot(ecl) cutree(ecl, k=3) cutree(ecl, h=150) ## compare performance of e-clustering, Ward's method, group average method ## when sampled populations have equal means: n=200, d=5, two groups z <- rbind(matrix(rnorm(1000), nrow=200), matrix(rnorm(1000, 0, 5), nrow=200)) g <- c(rep(1, 200), rep(2, 200)) d <- dist(z) e <- energy.hclust(d) a <- hclust(d, method="average") w <- hclust(d^2, method="ward") list("E" = table(cutree(e, k=2) == g), "Ward" = table(cutree(w, k=2) == g), "Avg" = table(cutree(a, k=2) == g))