Parallel Evaluation of Large Scale Hierarchical Clustering Results David Cruz-Rodriguez Computer Engineering Dr. Luis M. Vicente, Ph.D. Electrical & Computer Engineering Polytechnic University of Puerto Rico Abstract ⎯ Data clustering refers to the automatic same color are given the same label, meaning that grouping of object based on their similarity, i.e., they are in the same cluster. similar objects should be in the same group and dissimilar objects should be in different groups. In particular, for hierarchical clustering algorithms there is also the notion of a hierarchy in which the objects and the cluster fit. Clustering is a fundamental task in data mining, machine learning, information retrieval, bioinformatics, and image analysis, among others. It is important to evaluate the result of clustering algorithms. However most evaluations approaches are geared towards non- hierarchical clustering approaches; this research explores how to use traditional validity measures to Figure 1 evaluate and assess hierarchical clustering results. K-means Clustering Key Terms ⎯ Clustering, Data Clustering, Since, there is a huge amount of clustering Hierarchical Clustering, and Validity Measures. approaches to solve or model the same problem. Imagine that you are given a large data-set, a DEVELOPMENT OF THE PROBLEM challenge that you may confront is to choose the perfect clustering model that could be more Clustering is a fundamental task in data adequate to better describe the data-set. Another mining, machine learning, information retrieval, important question that may arise is how to confirm bioinformatics, and image analysis, among others. the strong fit of the chosen clustering model. They The challenge to evaluate the goodness of the are known as cluster validity methods. clustering validity is equally important as the result Huge amounts of validity indexes have been obtained by the generated algorithm. The most proposed in literature to address the problem of common approach to solve this challenge is using validity fit. For example, several of the approaches non-hierarchical evaluations. This research are based on counting pairs [1], [2], [3], [4], [5]. explores a different approach; using traditional non- The most important criterion dealing with counting hierarchical validity indexes to assess the pairs of points, when comparing the clustering, is hierarchical approach. determining if the points on which the two Background clustering’s agree or disagree [2]. Others examples Cluster is the organization of similar objects are based on entropy and purity, calculating the that are inter-similar to each other and intra-similar quality of a set of clusters among many others [1]. to different clusters. Significance of the Study On other words, points within a cluster are This research is to surveys the core concepts more similar to each other than points belonging to and techniques in the large variety of cluster other clusters. An example of clustering is shown in next Figure 1. Here, the points belonging to the validation for non-hierarchical to be applied into hierarchical evaluation. General speaking, there are three types of clustering validation techniques: external, internal and relative criteria [6], [7]. On this research the attention should be focused on the external criteria. REVIEW OF RELEVANT LITERATURE In this section, there is an overview of the clustering evaluation and validation indexes. Figure 2 Flat vs. Hierarchical Clustering Hierarchical Clustering The purpose of the flat clustering is to set Imagine that we have a dispersion of 10 points clusters without any explicit structure between the in 2-D and by applying the hierarchical algorithm labels of each cluster. Again, a good example is the obtained the output depicted on Figure 2. By quick k-mean algorithm result from the Figure 1. As is inspection it can be determined a quick similarity. illustrated in the depicted figure: the data is cluster That each level of the hierarchy has at least a and labeled with a color to symbolize the cluster. couple of merging, forming a cluster. Except the From the exploration standpoint is easily clear how label number 6, identified as the color violet, that similar are the data between one to the other. The only has one merging. Additionally, the cutting algorithm is quite simple and effective, it minimize point will give the numbers of clusters in each level the sum of squares and the corresponding cluster of the hierarchy and the vertical axis will provide centroid. Perhaps, that is why the k-means is a the similarity measure between clusters. In other well-known flat clustering algorithm between words, the hierarchical method is more informative academics. Also, the expectation maximation is than the unstructured representation of the flat one of the most popular ones. clustering. Also, the hierarchical clustering is very Between the flat clustering you can make a suitable because it does not require the number of second distinction, described below. clusters, as expected in flat clustering. • Hard Clustering: Based on each label belong Hierarchical clustering can be general grouped to one cluster or they don’t. in to two main categories, described below. • Soft Clustering: Based on each label is • Agglomerative: Based on the “bottom up” distributed over all clusters. approach. By looking Figure 2., the The k-means [8] is a non-hierarchical agglomerative approach starts in each number clustering and the most important flat clustering as a cluster (1, 3, 4, 9, 10, 13, 21, 23, 28, and algorithm. According to [8], this application is not 29). Then merges into a cluster, depending on to find some unique, definitive grouping, but rather the selected parameter, until one cluster is left. to simply aid the investigator in obtaining • Divisive: Based on the “top down” approach. qualitative and quantitative understanding of large Using the same example, but this time starting amounts of N-dimensional data by providing him with all the numbers in on cluster, and splits good similarity groups. the cluster until a single number cluster are left. By looking a hierarchical clustering (see Figure The depiction of a hierarchical clustering is 2) output it is cleared that is more informative than called a dendogram. The dendogram is the the unstructured set of clusters reviewed in Flat hierarchy tree representation that shows the Clustering. structure of the clusters as depicted in Figure 2. Cluster Validity Methods among many others. The Purity index measures the Generally speaking, there are three types of frequency of the most common labels into each clustering validation techniques external, internal, cluster. and relative criteria [6], [7]. VALIDATION INDEXES • External Criteria: Based on the priori knowledge about the data. This means that we In this section, there is an overview of evaluate the data based on previous set of validation indexes for external criterion. Refer to clusters or results of a clustering algorithm. Is the Table 1 below; to see the notation meaning of each individual validity indices. usually referred as “partition” (P). Table 1 • Internal Criteria: Based on the vectors of the Notation in Validity Indices Notation Meaning data set alone. This case is very different from M Maximum Number of Pairs the external criteria, because the clustering N Total Number of Points !!" Number of elements in !!! results are evaluated from the data clustering partitions !!! clusters results themselves. !! Abstract and Key Terms !! Body Text • Relative Criteria: Based on the evaluation of !! Section Headings clustering structuring by comparing it to other !!" Section Sub-headings clustering schemes, a comparison with the ! Endnote !! Equations different algorithm but same data inputs. Therefore, if only the attention is given to the Metric Based on Counting Pairs external validate-cluster data-sets, is important to One of the approaches to evaluate metrics for keep in mind that the external criteria is required to clustering is considering statistics over pairs of possess a prior knowledge of the data-sets. For items [2], [6], [7]. The most important concept example, lets imagine that you possess a supervised criterion dealing with counting pairs of points, learning and unsupervised learning results. In order when comparing the clustering, is determining the to use the external indexes you need the prior points on which the two clustering agree or knowledge of the data sets that is recognized as the disagree. supervised learning. This is the data that a human For example, imagine that a given set of n impose. Therefore, the external criteria compares objects D = O!,… ,O! , suppose that C = between prior information with the generated by the C!,… , C! (C is our clustering result) and clustering results. Also, important to keep in mind P = P!,… , P!! (P is our external criterion or that in the real world usually there is no prior partition). Both represent two different partitions information of the data sets. of the objects D (D is our data). Having this in On previous works there are various measures mind, we can create the contingency table or which are to measure the strong fit of the data set confusion matrix [10]; and [2] between our produced by clustering algorithm [1], [6], [7], [9]. partition and the cluster (see Figure 3). The first thing to start with is to review the well- Partition/Cluster !! !! … !! ′ Sums known external indexes: the Rand index [4], !! !!! !!" … !!! !!. Adjusted Rand index [5], Jaccard index [1], and !! !!" !!! … !!! !!. ⋮ ⋮ ⋮ ⋮ ⋮ Folkes & Mallows index [3], which are based on !! !!! !!! … !!! ′ !!. counting the pair of points on which two points Sums !.! !.! … !.! !.. = ! agree or disagree. The Entropy index measures the Figure 3 quality of the cluster in each single class labels, Confusion Matrix which according to [1], entropy technique have also The counting pairs matrix is the overlapping been defined as “variation of information” [2], between the pair of points that can only fall in less than one of four cases described below (see Figure !!" ! !! !! ! !"# =   !" ! ! ! ! ! / ! (3) 4): SS is the number of pairs of items belonging to ! !! ! !! ! ! ! ! ! ! ! the same cluster and partition; SD is the number of ! ! ! ! ! ! ! !! pairs belonging to the same cluster and different partition; DS is the number of pairs belonging to The Rand index is bounded by 1, and the value different cluster and the same partition; DD is the 0 is taken when the index equals its expected value number of pairs belonging to different cluster and [5]. partition. • Jaccard Index: Another useful approach to P !′ measure the overlapping between partitions is C SS SD the Jaccard index [1][5]. The Jaccard index ! ′ DS DD Figure 4 may be expressed such as Equation (4): Counting Pairs Matrix !"##"$%  !"#$%   ! =   !! (4) The four counts always satisfy M =  DD + (!!!!"!!") SD + DS + SS = n(n − 1)/2 (meaning M is the Same as the Rand index, each given attribute of maximum number of all pairs and where n is the two objects, lies between 0 and 1. total number of points between C and P). The • Folkes and Mallows Index: [3] develop quantities between SS & DD can be interpreted as another method for comparing partitions. On agreement “good choices” and SD & DS as [11] they give a brief descriptions how this disagreements “bad choices”. work. Defined such as Equation (5). Some of the measures to define similarity between counting pairs such as: !" = !! ∙ !! (5) (!!!!") (!!!!") • Rand Index: The Rand index (see Equation (1)) or Rand measure (Equation (2)) [4] in Metrics Based on Entropy statistics, and in particular in data clustering, is The Entropy index (Equation (6)) calculates a measure of agreement between the partitions; the sum of the entropies of each cluster !!. which [11], recommended as “This measure Therefore, if the result consists of objects with only appears to be one of the most popular a single label, the entropy is 0. This means that a alternatives for comparing partitions…”(p. 193 perfect clustering solution is when the entropy is 0. – 194). However, if the clusters that contain documents !"#$  !"#$%   ! =   !!!!! (1) from single class labels become more diverse, the (!!!!"!!"!!!) entropy value will grow more redundancy. The ! =   (!!!!!)   (2) ! lower the entropy value, the better the clustering is. According to [4] the Rand index lies between 0 !"#$%&'  !"#$% =   ! !!!!! !! (6) ! and 1. When the two partitions agree perfectly, the Rand index is 1. The disadvantage of the Rand Evaluation by Set Matching index is that the expected value of the Rand index The Purity index (Equation (7)) is very similar of two random partitions does not take a constant to the entropy index. In order to calculate the value (say zero) [5]. purity of the cluster we have to calculate the purity • Adjusted Rand Index: According to the Web !! in each cluster. Then to calculate the overall page Wikipedia (2011), the Adjusted Rand purity index we use the weighted sum of the index proposed by [11] is the corrected for individual cluster purities. chance version of the Rand index (see Equation !! (3)). !"#$%&  !"#$% =   !!!! !! (7) ! RESULTS Partition/Cluster !! !! !! !! !! Sum In this section it is shown an experimental !! 1 2 1 2 5 11 testing using a flat clustering result example. As !! 4 2 3 0 3 12 !! 2 3 0 2 2 9 shown in Table 2, computing validation indexes !! 2 2 0 0 4 8 can take a long time to complete using only one !! 0 0 0 0 0 0 Sum 9 9 4 4 14 n = 40 computer to analyze millions of objects. Parallel or Figure 6 distributed computing takes advantage of these Confusion Matrix from the Data Set validation indexes, by arranging them to work Then, thru computing the confusion matrix, together on the same problem, therefore reducing now we can use the counting pairs matrix (see the time needed to obtain the evaluation solution. Figure 4). As presented (Equation (9)) [5], [11] Almost all indexes that are based on the prior that SS + DD can be simplified to a linear knowledge about the data (external criteria) can be transformation of: described using the so-called confusion matrix, or !!. !.! association matrix or contingency table for the ! !!" ! ! ! ! !!" !!"!! !,! ! = ! =   !,! (9) ! evaluation on counting pairs (see Table 2). The ! confusion matrix [11] is a K×K! matrix, whose n!" With simple algebra, the counting pairs matrix elements is the number of points that overlaps the can be calculated (see Figure 7): pairs of points in the given set of n objects P !! C !!" D = O ,… ,O = 39 !.! !!" ! ! . Let n!. and n.! be the number of !,! !   2 − 2 = 136 ! !,! elements that are in whole data set using the next !! !!. !− !" ! = 459 Equation (8) (see Figure 3). 2 2 2! !,! = 146 ! Figure 7 !" = !! ∩ !!′ (8) Counting Pairs Matrix Results We used a data-set (see Figure 5). This data- The maximum numbers of all pairs in our data set were first evaluated in sequential and then in set is M = 39 + 136 + 146 + 459 = 780  and the parallel processing. total number of points n = 40. Now we can directly compute the external validity indexes defined in Table 1 to measure the degree of similarity between P and C. Using the direct values from the counting pairs matrix we can calculate the Rand, Adjusted Rand, Jaccard, and Folekes & Mallows indexes (see Figure 8). Validation Results Similarity Figure 5 Indexes ! 0 ≤  ! ≤ 1 Data Set Weak Strong 1 R 0.6385 √ Sequential 2 ARI 0.018098 √ 3 J 0.1215 √ First, by using the given set of n objects 4 FM 0.2168 √ D   O!,… ,O! , supposing that  C = C!,… , C and Figure 8 ! Validation Indexes Results Using the Figure 7 P = P!,… , P!′ represent the two partition from the objects in D, we can calculate the confusion matrix shown in Figure 3. The confusion matrix will be 5×5′, whose n!" elements are represented in Figure 6. Table 2 First, let’s try to see if the counting pair’s External Indexes Constraints matrix is possible in parallel, which is the key element of calculating most of the external indexes. Validation Indexes Notation Clustering Method O(*) Lets Imagine that we have the data-set of ! objects ! = !!,… ,!!  that we discussed earlier, !! + !! 1 Rand Flat ! ! (!! + !" + !" + !! supposing that  ! = !!,… ,!! and !!" !! !! ! !" ! − ! ! ! ! / ! Adjusted ! = !!,… ,!!! represent the two partition from 2 !! !!Rand 1 ! ! ! ! ! ! Flat ! ! ! !2 ! ! + ! ! − ! the objects in !. Then, imagine that the machine ! !! 3 Jaccard achieves to calculate the confusion matrix and then (!! + !" + !") Flat ! ! starts the partition of the matrix data set to the salve Folkes 4 and !! !!∙ Flat ! ! (!! + !") (!! + !") “tasks” (see Figure 10 and Equation (10)). Mallows ! !!. !! ;    !! 5 Entropy !!!! Flat ! ! = !!" !"#! !!"     ! ! ! 1 6 Purity !. ! ;    ! = !"# ! ! ! ! ! ! !" Flat ! ! !!! !. For the other two validity indexes presented in the Table 2, Entropy and Purity, we will use the confusion matrix (see Figure 9). Figure 10 Cluster !! !! !! !! !! Entropy Purity Parallelization of the Data Set !! 1 2 1 2 5 2.04037 0.4545 !! 4 2 3 0 3 1.95915 0.3333 !!!! !! = !! + !! +⋯+ !! (10) !! 2 3 0 2 2 1.97494 0.3333 !! 2 2 0 0 4 1.5 0.5 Now, let’s expand the Equation (11) and we !! 0 0 0 0 0 0 0 Total 1.95915 0.4 obtain: Figure 9 Entropy and Purity Calculation !!" !!"!! !!" !=   ! !" !! ! ! + !" ! ! !" !! ! (11) ! ! ! As we can see, with respect to the data set, the Thru the least simplification expression we can validity evaluation proves that we fail on having a see that by the completing the square, a method strong fit in the data set. In other words, the results derivate by the quadratic formula, is not currently show that our clustering ! posses a weak similarity possible the calculation of the counting pairs matrix with !. in parallel (see Equation (12)). Parallel ! ! !!" = !!" + !!" (12) Task parallelism is a form of computation in ! ! which many task are divided to compute a Instead of using the counting pairs matrix, lets calculation simultaneously distributed, operating use the confusion matrix. By inspection we can see with the principle that large scale of data-sets could that this method in parallel is easily done. Since, be computed in small partitions, resulting with the the confusion matrix is a total summation of the same expected sequential result. clusters, satisfy the early equation discussed. The time complexity of large-scale data sets First, the machine will create a partition of the generally increases, and so with the cluster data set. Then, is given to the slaves to calculate the validation process that evaluates them. The time small confusion matrix partition. Finally, the slave complexity of sequential or parallel is not discussed return the results to the machine to create the total in our research. confusion matrix and continue with the sequential calculation earlier explain. CONCLUSION Empirical Study on Principal Component Analysis for Clustering Gene Expression Data”, Bioinformatics, 17(9), The problem of determining the strong fit of 2001, 763.] Yeung, K., et al., “Details of the Adjusted large-scale data sets is an important issue for Rand Index and Clustering Algorithms Supplement to the Paper – An Empirical Study on Principal Component clustering processes and also challenging one. In Analysis for Clustering Gene Expression Data”, this work, a method to improve the complexity of Bioinformatics, 17(9), 2001, 763. the evaluation of large-scale data-sets have been [6] Halkidi, M., et al., “Cluster Validity Methods: Part 1”, presented, based on external criterion inspired in an ACM Special Interest Group on Management of Data information theoretic approach to assess the Record, 3(3), 2002, 42 – 45. validity of the clustering solutions in parallel. [7] Halkidi, M., et al., “Clustering Validity Checking Methods: The experiment carried out on a synthetic data Part 2”, ACM Special Interest Group on Management of set, using six external indexes; show that, when we Data, 31(3), 2002, 19 – 27. create this method in parallel by theoretic the [8] MacQueen, J., “Some Methods for Classification and complexity drops, meaning that the validity Analysis of Multivariate Observations”, Proceedings of 5 th Berkley Symposium on Mathematical Statistics and evaluation will be faster. Probability, 5, 1967, 281 – 297. As it has also been defined, another issue [9] Heeringa, W., et al., “Validating Dialect Comparison beyond the scope of this work is the problem of Methods”, Classification, Automation, and New Media, evaluating large-scale hierarchical clustering 2000, 445 -460. results. For future work, two possible approaches [10] Pearson, K., “Mathematical Contributions to the Theory of for validity methods were proposed to define a pair Evolution”, Drapers’ Company Research Memoris, 1904, counting strategy that takes into account how close 13 – 17. in the dendogram the points are, and to compute [11] Hubert, L., et al., “Comparing Partitions”, Journal of several measure for different flat results by Classification, 2(1), 1985, 193 – 218. ‘cutting’ the dendogram at various levels, and [12] Davies, D., L., et al., “A Cluster Separation Measure”, report their average. Pattern Analysis and Machine Intelligence, IEEE Since, the hierarchical clustering is a structure Transactions, 1(2), 1979, 224 – 227. exploration of the data that encompasses great [13] Dunn, J., C., “Well Separated Clusters and Optimal Fuzzy further contributions to data-mining, information Partitions”, Journal of Cybernetica, 4, 1974, 95 – 104. retrieval, among many others; still a need for [14] Fraley, C., et al., “How Many Clusters? Which Clustering developing quality measures that assess the quality Method? – Answers Via Model-Based Cluster Analysis”, of the hierarchical clustering. The Computer Journal, 41(8), 1998, 578 – 588. [15] Grimmer, J., et al., “Quantitative Discovery from References Qualitative Information: A General-Purpose Document Clustering Methodology”, PNAS, 108(7), 2011, 2643 – [1] Amigo, E., et al., “A Comparison of Extrinsic Clustering 2650. Evaluation Metrics Based on Formal Constraints”, [16] Jain, A., K., et al., “Data Clustering: A Review”, ACM Springer Standard Collection, 12(4), 2008, 461 – 486. Computing Surveys (CSUR), 31(3), 1999. [2] Meila, M., “Comparing Clustering”, Association for [17] Raftery, A., “A Note on Bayes Factor for Long-Linear Computing Machinery, 2005, 577 – 584. Contingency Table Models with Vague Prior Information”, [3] Fowlkes, E., et al., “A Method for Comparing Two Journal of the Royal Statistical Society, 48(2), 1986, 249 – Hierarchical Clustering’s”, American Statistical 250. Association, 78(383), 1983, 553 – 569. [18] Zhaou, Y., “Empirical and Theoretical Comparisons of [4] Rand, W., “Objective Criteria for the Evaluation of Selected Criterion Functions for Document Clustering”, Clustering Methods”, Journal of the American Statistical Machine Learning, 55(3), 2004, 311 – 331. Association, 66, 1971, 846 – 850. [5] Yeung, K., et al., “Details of the Adjusted Rand Index and Clustering Algorithms Supplement to the Paper – An