Parallel Evaluation of Large Scale Hierarchical Clustering Results 
 
David Cruz-Rodriguez 
Computer Engineering 
Dr. Luis M. Vicente, Ph.D. 
Electrical & Computer Engineering  
Polytechnic University of Puerto Rico 
Abstract ⎯ Data clustering refers to the automatic same color are given the same label, meaning that 
grouping of object based on their similarity, i.e., they are in the same cluster.  
similar objects should be in the same group and 
dissimilar objects should be in different groups.  In 
particular, for hierarchical clustering algorithms 
there is also the notion of a hierarchy in which the 
objects and the cluster fit.  Clustering is a 
fundamental task in data mining, machine learning, 
information retrieval, bioinformatics, and image 
analysis, among others.  It is important to evaluate 
the result of clustering algorithms.  However most 
evaluations approaches are geared towards non-
hierarchical clustering approaches; this research  
explores how to use traditional validity measures to Figure 1 
evaluate and assess hierarchical clustering results.  K-means Clustering 
Key Terms ⎯ Clustering, Data Clustering, Since, there is a huge amount of clustering 
Hierarchical Clustering, and Validity Measures. approaches to solve or model the same problem.  
Imagine that you are given a large data-set, a 
DEVELOPMENT OF THE PROBLEM challenge that you may confront is to choose the 
perfect clustering model that could be more 
Clustering is a fundamental task in data 
adequate to better describe the data-set.  Another 
mining, machine learning, information retrieval, 
important question that may arise is how to confirm 
bioinformatics, and image analysis, among others.  
the strong fit of the chosen clustering model.  They 
The challenge to evaluate the goodness of the 
are known as cluster validity methods.  
clustering validity is equally important as the result 
Huge amounts of validity indexes have been 
obtained by the generated algorithm.  The most 
proposed in literature to address the problem of 
common approach to solve this challenge is using 
validity fit.  For example, several of the approaches 
non-hierarchical evaluations.  This research 
are based on counting pairs [1], [2], [3], [4], [5].  
explores a different approach; using traditional non-
The most important criterion dealing with counting 
hierarchical validity indexes to assess the 
pairs of points, when comparing the clustering, is 
hierarchical approach. 
determining if the points on which the two 
Background clustering’s agree or disagree [2].  Others examples 
Cluster is the organization of similar objects are based on entropy and purity, calculating the 
that are inter-similar to each other and intra-similar quality of a set of clusters among many others [1]. 
to different clusters.  Significance of the Study 
On other words, points within a cluster are 
This research is to surveys the core concepts 
more similar to each other than points belonging to 
and techniques in the large variety of cluster 
other clusters.  An example of clustering is shown 
in next Figure 1.  Here, the points belonging to the 
validation for non-hierarchical to be applied into 
hierarchical evaluation.  
General speaking, there are three types of 
clustering validation techniques: external, internal 
and relative criteria [6], [7].  On this research the 
attention should be focused on the external criteria. 
REVIEW OF RELEVANT LITERATURE 
In this section, there is an overview of the 
clustering evaluation and validation indexes.  
Figure 2 
Flat vs. Hierarchical Clustering Hierarchical Clustering 
The purpose of the flat clustering is to set Imagine that we have a dispersion of 10 points 
clusters without any explicit structure between the in 2-D and by applying the hierarchical algorithm 
labels of each cluster.  Again, a good example is the obtained the output depicted on Figure 2.  By quick 
k-mean algorithm result from the Figure 1.  As is inspection it can be determined a quick similarity.  
illustrated in the depicted figure: the data is cluster That each level of the hierarchy has at least a 
and labeled with a color to symbolize the cluster.  couple of merging, forming a cluster.  Except the 
From the exploration standpoint is easily clear how label number 6, identified as the color violet, that 
similar are the data between one to the other.  The only has one merging.  Additionally, the cutting 
algorithm is quite simple and effective, it minimize point will give the numbers of clusters in each level 
the sum of squares and the corresponding cluster of the hierarchy and the vertical axis will provide 
centroid.  Perhaps, that is why the k-means is a the similarity measure between clusters.  In other 
well-known flat clustering algorithm between words, the hierarchical method is more informative 
academics.  Also, the expectation maximation is than the unstructured representation of the flat 
one of the most popular ones.  clustering.  Also, the hierarchical clustering is very 
Between the flat clustering you can make a suitable because it does not require the number of 
second distinction, described below. clusters, as expected in flat clustering. 
• Hard Clustering:  Based on each label belong Hierarchical clustering can be general grouped 
to one cluster or they don’t. in to two main categories, described below. 
• Soft Clustering:  Based on each label is • Agglomerative:  Based on the “bottom up” 
distributed over all clusters. approach.  By looking Figure 2., the 
The k-means [8] is a non-hierarchical agglomerative approach starts in each number 
clustering and the most important flat clustering as a cluster (1, 3, 4, 9, 10, 13, 21, 23, 28, and 
algorithm.  According to [8], this application is not 29).  Then merges into a cluster, depending on 
to find some unique, definitive grouping, but rather the selected parameter, until one cluster is left. 
to simply aid the investigator in obtaining • Divisive:  Based on the “top down” approach.  
qualitative and quantitative understanding of large Using the same example, but this time starting 
amounts of N-dimensional data by providing him with all the numbers in on cluster, and splits 
good similarity groups. the cluster until a single number cluster are left. 
By looking a hierarchical clustering (see Figure The depiction of a hierarchical clustering is 
2) output it is cleared that is more informative than called a dendogram.  The dendogram is the 
the unstructured set of clusters reviewed in Flat hierarchy tree representation that shows the 
Clustering. structure of the clusters as depicted in Figure 2. 
Cluster Validity Methods among many others.  The Purity index measures the 
Generally speaking, there are three types of frequency of the most common labels into each 
clustering validation techniques external, internal, cluster. 
and relative criteria [6], [7]. VALIDATION INDEXES 
• External Criteria:  Based on the priori 
knowledge about the data.  This means that we In this section, there is an overview of 
evaluate the data based on previous set of validation indexes for external criterion.  Refer to 
clusters or results of a clustering algorithm.  Is the Table 1 below; to see the notation meaning of each individual validity indices. 
usually referred as “partition” (P). Table 1 
• Internal Criteria:  Based on the vectors of the Notation in Validity Indices 
Notation Meaning 
data set alone.   This case is very different from M Maximum Number of Pairs 
the external criteria, because the clustering N Total Number of Points 
!!"  Number of elements in !!!  
results are evaluated from the data clustering partitions  !!!  clusters 
results themselves. !!  Abstract and Key Terms 
!!  Body Text 
• Relative Criteria:  Based on the evaluation of !!  Section Headings 
clustering structuring by comparing it to other !!"  Section Sub-headings 
clustering schemes, a comparison with the ! Endnote !!  Equations 
different algorithm but same data inputs. 
Therefore, if only the attention is given to the Metric Based on Counting Pairs 
external validate-cluster data-sets, is important to One of the approaches to evaluate metrics for 
keep in mind that the external criteria is required to clustering is considering statistics over pairs of 
possess a prior knowledge of the data-sets.  For items [2], [6], [7].  The most important concept 
example, lets imagine that you possess a supervised criterion dealing with counting pairs of points, 
learning and unsupervised learning results.  In order when comparing the clustering, is determining the 
to use the external indexes you need the prior points on which the two clustering agree or 
knowledge of the data sets that is recognized as the disagree. 
supervised learning. This is the data that a human For example, imagine that a given set of n 
impose.  Therefore, the external criteria compares objects D = O!,… ,O! , suppose that  C =
between prior information with the generated by the C!,… , C!  (C is our clustering result) and 
clustering results.  Also, important to keep in mind P = P!,… , P!!  (P is our external criterion or 
that in the real world usually there is no prior partition).  Both represent two different partitions 
information of the data sets.   of the objects D (D is our data).  Having this in 
On previous works there are various measures mind, we can create the contingency table or 
which are to measure the strong fit of the data set confusion matrix [10]; and [2] between our 
produced by clustering algorithm [1], [6], [7], [9].  partition and the cluster (see Figure 3). 
The first thing to start with is to review the well- Partition/Cluster !! !! … !! ′  Sums 
known external indexes: the Rand index [4], !! !!! !!" … !!!  !!. 
Adjusted Rand index [5], Jaccard index [1], and !! !!" !!! … !!!  !!. 
⋮ ⋮ ⋮  ⋮ ⋮ 
Folkes & Mallows index [3], which are based on !!  !!! !!! … !!! ′  !!. 
counting the pair of points on which two points Sums !.! !.! … !.!  !.. = ! 
agree or disagree.  The Entropy index measures the Figure 3 
quality of the cluster in each single class labels, Confusion Matrix 
which according to [1], entropy technique have also The counting pairs matrix is the overlapping 
been defined as “variation of information” [2], between the pair of points that can only fall in less 
than one of four cases described below (see Figure !!" ! !! !! !
!"# =    !" ! ! ! ! !
/ !        (3) 
4): SS is the number of pairs of items belonging to ! !!
! !! !
!
! ! ! ! !
the same cluster and partition; SD is the number of ! ! ! ! ! ! ! !!
pairs belonging to the same cluster and different 
partition; DS is the number of pairs belonging to The Rand index is bounded by 1, and the value 
different cluster and the same partition; DD is the 0 is taken when the index equals its expected value 
number of pairs belonging to different cluster and [5]. 
partition. • Jaccard Index:  Another useful approach to 
 P !′ measure the overlapping between partitions is 
C SS SD the Jaccard index [1][5].  The Jaccard index 
! ′ DS DD 
Figure 4 may be expressed such as Equation (4): 
Counting Pairs Matrix 
!"##"$%  !"#$%   ! =    !!                        (4) 
The four counts always satisfy M =   DD + (!!!!"!!")
SD + DS + SS = n(n − 1)/2 (meaning M is the Same as the Rand index, each given attribute of 
maximum number of all pairs and where n is the two objects, lies between 0 and 1. 
total number of points between C and P).  The • Folkes and Mallows Index: [3] develop 
quantities between SS & DD can be interpreted as another method for comparing partitions.  On 
agreement “good choices” and SD & DS as [11] they give a brief descriptions how this 
disagreements “bad choices”. work.  Defined such as Equation (5). 
Some of the measures to define similarity 
between counting pairs such as: !" = !! ∙ !!                         (5) 
(!!!!") (!!!!")
• Rand Index:  The Rand index (see Equation 
(1)) or Rand measure (Equation (2)) [4] in Metrics Based on Entropy 
statistics, and in particular in data clustering, is The Entropy index (Equation (6)) calculates 
a measure of agreement between the partitions; the sum of the entropies of each cluster !!. 
which [11], recommended as “This measure Therefore, if the result consists of objects with only 
appears to be one of the most popular a single label, the entropy is 0.  This means that a 
alternatives for comparing partitions…”(p. 193 perfect clustering solution is when the entropy is 0.  
– 194). However, if the clusters that contain documents 
!"#$  !"#$%   ! =    !!!!!                       (1) from single class labels become more diverse, the 
(!!!!"!!"!!!) entropy value will grow more redundancy.  The 
! =    (!!!!!)              (2) 
! lower the entropy value, the better the clustering is. 
According to [4] the Rand index lies between 0 !"#$%&'  !"#$% =    ! !!!!! !!                       (6) !
and 1.  When the two partitions agree perfectly, the 
Rand index is 1.  The disadvantage of the Rand Evaluation by Set Matching 
index is that the expected value of the Rand index The Purity index (Equation (7)) is very similar 
of two random partitions does not take a constant to the entropy index.  In order to calculate the 
value (say zero) [5]. purity of the cluster we have to calculate the purity 
• Adjusted Rand Index:  According to the Web !! in each cluster.  Then to calculate the overall 
page Wikipedia (2011), the Adjusted Rand purity index we use the weighted sum of the 
index proposed by [11] is the corrected for individual cluster purities. 
chance version of the Rand index (see Equation 
!!
(3)). !"#$%&  !"#$% =    !!!! !!          (7) !
RESULTS  
Partition/Cluster !! !! !! !! !! Sum 
In this section it is shown an experimental !! 1 2 1 2 5 11 
testing using a flat clustering result example.  As !! 4 2 3 0 3 12 !! 2 3 0 2 2 9 
shown in Table 2, computing validation indexes !! 2 2 0 0 4 8 
can take a long time to complete using only one !! 0 0 0 0 0 0 
Sum 9 9 4 4 14 n = 40 
computer to analyze millions of objects. Parallel or Figure 6 
distributed computing takes advantage of these Confusion Matrix from the Data Set 
validation indexes, by arranging them to work Then, thru computing the confusion matrix, 
together on the same problem, therefore reducing now we can use the counting pairs matrix (see 
the time needed to obtain the evaluation solution.  Figure 4).  As presented (Equation (9)) [5], [11] 
Almost all indexes that are based on the prior that SS + DD can be simplified to a linear 
knowledge about the data (external criteria) can be transformation of: 
described using the so-called confusion matrix, or 
!!. !.!
association matrix or contingency table for the ! !!" ! ! ! !
!!" !!"!!
!,! ! = ! =    !,!        (9) !
evaluation on counting pairs (see Table 2).  The !
confusion matrix [11] is a K×K! matrix, whose n!" With simple algebra, the counting pairs matrix 
elements is the number of points that overlaps the can be calculated (see Figure 7): 
pairs of points in the given set of n objects  P !! 
C !!"
D = O ,… ,O  = 39 
!.! !!"
! ! .  Let n!. and n.! be the number of 
!,! !    2 − 2 = 136 
! !,!
elements that are in whole data set using the next !! !!. !− !" ! = 459 
Equation (8) (see Figure 3). 2 2 2! !,!
= 146 
! Figure 7 !" = !! ∩ !!′                          (8) 
Counting Pairs Matrix Results 
We used a data-set (see Figure 5).  This data-
The maximum numbers of all pairs in our data 
set were first evaluated in sequential and then in 
set is M = 39 + 136 + 146 + 459 = 780  and the 
parallel processing. 
total number of points n = 40.  Now we can 
directly compute the external validity indexes 
defined in Table 1 to measure the degree of 
similarity between P and C.  Using the direct values 
from the counting pairs matrix we can calculate the 
Rand, Adjusted Rand, Jaccard, and Folekes & 
Mallows indexes (see Figure 8). 
 Validation Results               Similarity 
Figure 5 Indexes ! 0 ≤   ! ≤ 1 
Data Set Weak Strong 
1 R 0.6385  √ 
Sequential 2 ARI 0.018098 √  
3 J 0.1215 √  
First, by using the given set of n objects 4 FM 0.2168 √  
D   O!,… ,O! , supposing that  C = C!,… , C  and Figure 8 !
Validation Indexes Results Using the Figure 7 
P = P!,… , P!′  represent the two partition from the 
objects in D, we can calculate the confusion matrix  
shown in Figure 3.  The confusion matrix will be  
5×5′, whose n!" elements are represented in Figure  
6.  
Table 2 First, let’s try to see if the counting pair’s 
External Indexes Constraints matrix is possible in parallel, which is the key 
element of calculating most of the external indexes.  
Validation 
Indexes Notation 
Clustering 
Method O(*) Lets Imagine that we have the data-set of ! objects 
! = !!,… ,!!   that we discussed earlier, !! + !!
1 Rand  Flat ! !  
(!! + !" + !" + !! supposing that  ! = !!,… ,!!  and 
!!" !! !! !
!" ! − ! ! ! ! / !
Adjusted  ! = !!,… ,!!!  represent the two partition from 2 !! !!Rand 1 ! ! ! ! ! ! Flat ! !  ! !2 ! ! + ! ! − ! the objects in !.  Then, imagine that the machine 
!
!!
3 Jaccard  achieves to calculate the confusion matrix and then (!! + !" + !") Flat ! !  starts the partition of the matrix data set to the salve 
Folkes    
4 and !! !!∙  Flat ! !  
(!! + !") (!! + !") “tasks” (see Figure 10 and Equation (10)). Mallows 
!
!!. !! ;     !!
5 Entropy !!!! Flat ! !  
= !!" !"#! !!"      
!
!
! 1
6 Purity !. ! ;     ! = !"# !  ! ! ! ! ! !" Flat ! !  
!!! !.
For the other two validity indexes presented in 
the Table 2, Entropy and Purity, we will use the  
confusion matrix (see Figure 9).  Figure 10 
Cluster !! !! !! !! !! Entropy Purity Parallelization of the Data Set 
!! 1 2 1 2 5 2.04037 0.4545 
!! 4 2 3 0 3 1.95915 0.3333 !!!! !! = !! + !! +⋯+ !!                     (10) 
!! 2 3 0 2 2 1.97494 0.3333 
!! 2 2 0 0 4 1.5 0.5 Now, let’s expand the Equation (11) and we 
!! 0 0 0 0 0 0 0 
Total 1.95915 0.4 obtain: 
Figure 9 
Entropy and Purity Calculation !!" !!"!! !!" !=    ! !"
!! !
! + !"
!
! !"
!!
!        (11) 
! ! !
As we can see, with respect to the data set, the 
Thru the least simplification expression we can 
validity evaluation proves that we fail on having a 
see that by the completing the square, a method 
strong fit in the data set.  In other words, the results 
derivate by the quadratic formula, is not currently 
show that our clustering ! posses a weak similarity 
possible the calculation of the counting pairs matrix 
with !. 
in parallel (see Equation (12)). 
Parallel !
! !!" = !!" + !!"          (12) Task parallelism is a form of computation in ! !
which many task are divided to compute a Instead of using the counting pairs matrix, lets 
calculation simultaneously distributed, operating use the confusion matrix.  By inspection we can see 
with the principle that large scale of data-sets could that this method in parallel is easily done.  Since, 
be computed in small partitions, resulting with the the confusion matrix is a total summation of the 
same expected sequential result. clusters, satisfy the early equation discussed. 
The time complexity of large-scale data sets First, the machine will create a partition of the 
generally increases, and so with the cluster data set. Then, is given to the slaves to calculate the 
validation process that evaluates them.  The time small confusion matrix partition.  Finally, the slave 
complexity of sequential or parallel is not discussed return the results to the machine to create the total 
in our research. confusion matrix and continue with the sequential 
calculation earlier explain. 
CONCLUSION Empirical Study on Principal Component Analysis for 
Clustering Gene Expression Data”, Bioinformatics, 17(9), 
The problem of determining the strong fit of 2001, 763.] Yeung, K., et al., “Details of the Adjusted 
large-scale data sets is an important issue for Rand Index and Clustering Algorithms Supplement to the 
Paper – An Empirical Study on Principal Component 
clustering processes and also challenging one.  In Analysis for Clustering Gene Expression Data”, 
this work, a method to improve the complexity of Bioinformatics, 17(9), 2001, 763. 
the evaluation of large-scale data-sets have been [6]  Halkidi, M., et al., “Cluster Validity Methods: Part 1”, 
presented, based on external criterion inspired in an ACM Special Interest Group on Management of Data 
information theoretic approach to assess the Record, 3(3), 2002, 42 – 45. 
validity of the clustering solutions in parallel.   [7] Halkidi, M., et al., “Clustering Validity Checking Methods: 
The experiment carried out on a synthetic data Part 2”, ACM Special Interest Group on Management of 
set, using six external indexes; show that, when we Data, 31(3), 2002, 19 – 27.  
create this method in parallel by theoretic the [8] MacQueen, J., “Some Methods for Classification and 
complexity drops, meaning that the validity Analysis of Multivariate Observations”, Proceedings of 5
th 
Berkley Symposium on Mathematical Statistics and 
evaluation will be faster. Probability, 5, 1967, 281 – 297. 
As it has also been defined, another issue 
[9] Heeringa, W., et al., “Validating Dialect Comparison 
beyond the scope of this work is the problem of Methods”, Classification, Automation, and New Media, 
evaluating large-scale hierarchical clustering 2000, 445 -460.   
results.  For future work, two possible approaches [10] Pearson, K., “Mathematical Contributions to the Theory of 
for validity methods were proposed to define a pair Evolution”, Drapers’ Company Research Memoris, 1904, 
counting strategy that takes into account how close 13 – 17. 
in the dendogram the points are, and to compute [11] Hubert, L., et al., “Comparing Partitions”, Journal of 
several measure for different flat results by Classification, 2(1), 1985, 193 – 218. 
‘cutting’ the dendogram at various levels, and [12]  Davies, D., L., et al., “A Cluster Separation Measure”, 
report their average. Pattern Analysis and Machine Intelligence, IEEE 
Since, the hierarchical clustering is a structure Transactions, 1(2), 1979, 224 – 227.  
exploration of the data that encompasses great [13] Dunn, J., C., “Well Separated Clusters and Optimal Fuzzy 
further contributions to data-mining, information Partitions”, Journal of Cybernetica, 4, 1974, 95 – 104.  
retrieval, among many others; still a need for [14] Fraley, C., et al., “How Many Clusters? Which Clustering 
developing quality measures that assess the quality Method? – Answers Via Model-Based Cluster Analysis”, 
of the hierarchical clustering. The Computer Journal, 41(8), 1998, 578 – 588. 
[15] Grimmer, J., et al., “Quantitative Discovery from 
References Qualitative Information: A General-Purpose Document 
Clustering Methodology”, PNAS, 108(7), 2011, 2643 – 
[1] Amigo, E., et al., “A Comparison of Extrinsic Clustering 2650.  
Evaluation Metrics Based on Formal Constraints”, [16] Jain, A., K., et al., “Data Clustering: A Review”, ACM 
Springer Standard Collection, 12(4), 2008, 461 – 486. Computing Surveys (CSUR), 31(3), 1999.  
[2] Meila, M., “Comparing Clustering”, Association for [17]  Raftery, A., “A Note on Bayes Factor for Long-Linear 
Computing Machinery, 2005, 577 – 584. Contingency Table Models with Vague Prior Information”, 
[3] Fowlkes, E., et al., “A Method for Comparing Two Journal of the Royal Statistical Society, 48(2), 1986, 249 – 
Hierarchical Clustering’s”, American Statistical 250.  
Association, 78(383), 1983, 553 – 569. [18] Zhaou, Y., “Empirical and Theoretical Comparisons of 
[4] Rand, W., “Objective Criteria for the Evaluation of Selected Criterion Functions for Document Clustering”, 
Clustering Methods”, Journal of the American Statistical Machine Learning, 55(3), 2004, 311 – 331. 
Association, 66, 1971, 846 – 850. 
[5] Yeung, K., et al., “Details of the Adjusted Rand Index and 
Clustering Algorithms Supplement to the Paper – An