Научная статья на тему 'A modified multilevel approach to the dynamic hierarchical clustering for complex types of shapes'

A modified multilevel approach to the dynamic hierarchical clustering for complex types of shapes Текст научной статьи по специальности «Медицинские технологии»

CC BY
41
13
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
CHAMELEON / ИНТЕЛЛЕКТУАЛЬНЫЙ АНАЛИЗ ДАННЫХ / DATA MINING / КЛАСТЕРИЗАЦИЯ / CLUSTERING

Аннотация научной статьи по медицинским технологиям, автор научной работы — Shatovska T.B., Onoprienko O.I., Fedorov A.O.

В этой статье мы представляем измененный иерархический алгоритм кластеризации, в котором использована основная идея алгоритма Chameleon. Эффективность предложенного подхода будет продемонстрирована экспериментальным путем

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

A modified multilevel approach to the dynamic hierarchical clustering for complex types of shapes

In this paper we present a modified hierarchical clustering algorithm that used the main idea of Chameleon and the effectiveness of suggested approach will be demonstrated by the experimental results

Текст научной работы на тему «A modified multilevel approach to the dynamic hierarchical clustering for complex types of shapes»

В даннш cmammi ми представляемо змшений iерархiчний алгоритм кластеризаци, в якому використа-но основну идею) алгоритму Chameleon. Ефективтсть запропонованого тдхо-ду буде продемонстровано експеримен-тальним шляхом

Ключовi слова: Chameleon, ттелек-

туальний аналiз даних, кластеризащя □-□

В этой статье мы представляем измененный иерархический алгоритм кластеризации, в котором использована основная идея алгоритма Chameleon. Эффективность предложенного подхода будет продемонстрирована экспериментальным путем

Ключевые слова: Chameleon, интеллектуальный анализ данных, кластеризация

□-□

In this paper we present a modified hierarchical clustering algorithm that used the main idea of Chameleon and the effectiveness of suggested approach will be demonstrated by the experimental results

Key words: Chameleon,data mining, clustering

-□ □-

УДК 004.89

A MODIFIED MULTILEVEL APPROACH TO THE DYNAMIC HIERARCHICAL CLUSTERING FOR COMPLEX TYPES OF

SHAPES

T.B. Shatovska

Ph.D., docent* Contact tel.: 095-825-27-45 E-mail: shatovska@gmail.com O.I. Onoprienko* Contact tel.: 063-855-50-74 Е-mail: oksana.onoprienko@gmail.com A.O. Fedorov* Contact tel.: 066-392-50-55 Е-mail: onoprienko.o@mail.ru *Department of Software engineering Kharkiv National University of Radio Electronics Lenina, 14, Kharkov, 61166

1. Introduction

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group in many applications. Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research. As a branch of statistics, cluster analysis has been studied extensively for many years, focusing mainly on distance-based cluster analysis. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data. Chameleon is a clustering algorithm that explores dynamic modeling in hierarchical clustering. In its clustering process, two clusters are merged if the interconnectivity and closeness between two clusters are highly related to the internal interconnectivity and closeness of objects within the clusters. The merge process based on the dynamic model facilitates the discovery of natural and homogeneous clusters and applies to all types of data as long as a similarity function is specified. Chameleon is derived based on the observation of the weakness of two hierarchical clustering algorithms: CURE and ROCK. CURE and related schemes ignore information about the aggregate interconnectivity of objects

in two different clusters, whereas ROCK and related schemes ignore information about the closeness of two clusters while emphasizing their interconnectivity. In this paper, we present our experiments with hierarchical clustering algorithm CHAMELEON for circles cluster shapes with different densities using hMETIS program that used multilevel k-way partitioning for hypergraphs and a Clustering Toolkit package that merges clusters based on a dynamic model. In CHAMELEON two clusters are merged only if the inter-connectivity and closeness between two clusters are comparable to the internal inter-connectivity of the clusters and closeness of items within the clusters. The methodology of dynamic modeling of clusters is applicable to all types of data as long as a similarity matrix can be constructed. We present a modified hierarchical clustering algorithm that measures the similarity of two clusters based on a new dynamic model with different shapes and densities.

The merging process using the dynamic model presented in this paper facilitates discovery of natural and homogeneous not only circles cluster shapes.

2. Related work

In this section, we give a brief description of existing clustering algorithms.

A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed.

The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. It successively merges the objects or groups close to one another, until all of the groups are merged into one, or until a termination condition holds. The divisive approach, also called the top-down approach, starts with all the objects in the same cluster. In each successive iteration, a cluster is spitted up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds.

Hierarchical methods suffer from the fact that once a step is done, it can never be undone. This rigidity is useful in that it leads to smaller computation costs by not worrying about a combinatorial number of different choices. However, a major problem of such techniques is that they cannot correct erroneous decisions. There are two approaches to improving the quality of hierarchical clustering: (1) perform careful analysis of object "linkages" at each hierarchical partitioning, such as in CURE and Chameleon, or (2) integrate hierarchical agglomeration and iterative relocation by first using a hierarchical ag-glomerative algorithm and then refining the result using iterative relocation, as in BIRCH [10].

Most clustering algorithms either favor clusters with spherical shape and similar sizes, or are fragile in the presence of outliers. CURE overcomes the problem of favoring clusters with spherical shape and similar sizes and is more robust with respect to outliers. CURE employs a novel hierarchical clustering algorithm that adopts a middle ground between centroid-based and representative-object-based approaches. Instead of using a single centroid or object to represent a cluster, a fixed number of representative points in space are chosen. The representative points of a cluster are generated by first selecting well-scattered objects for the cluster and then «shrinking» or moving them toward the cluster center by a specified fraction, or shrinking factor.

At each step of the algorithm, the two clusters with the closest pair of representative points (where each point in the pair is from a different cluster) are merged. ROCK is an alternative agglomerative hierarchical clustering algorithm that is suited for clustering categorical attributes. It measures the similarity of two clusters by comparing the aggregate interconnectivity of two clusters against a user-specified static interconnectivity model, where the interconnectivity of two clusters is defined by the number of cross links between the two clusters, and link is the number of common neighbors between two points. In other words, cluster similarity is based on the number of points from different clusters who have neighbors in common [2].

ROCK first constructs a sparse graph from a given data similarity matrix using a similarity threshold and the concept of shared neighbors. It then performs a hierarchical clustering algorithm on the sparse graph.

There are two major limitations of the agglomerative mechanisms used in existing schemes. First, these schemes do not make use of information about the nature of individual clusters being merged. Second, one set of schemes (CURE and related schemes) ignore the information about the aggregate interconnectivity of items in two clusters, whereas the other set of schemes ignore information about the closeness of two clusters as defined by the similarity of the closest items across two clusters.

3. Overview of CHAMELEON: Clustering Using Dynamic Modeling

Chameleon is a clustering algorithm that explores dynamic modeling in hierarchical clustering [5]. Chameleon represents its objects based on the commonly used k-near-est neighbor graph approach. This graph representation of the data set allows CHAMELEON to scale to large data sets. Each vertex of the k-nearest neighbor graph represents a data object, and there exists an edge between two objects if one object is among the k-most similar objects of the other. The k-nearest neighbor graph captures the concept that neighborhood radius of an object is determined by the density of the region in which this object resides [9].

During the next step a sequence of successively smaller hypergraphs are constructed - Coarsening Phase. Two primary schemes have been developed for selecting what groups of vertices will be merged together to form single vertices in the next level coarse hypergraphs. The first scheme called edge- coarsening (EC) [1] selects the groups by funding a maximal set of pairs of vertices (i.e., matching) that belong in many hyperedges. The second scheme that is called hyperedge-coarsening (HEC) [3] finds a maximal independent set of hyperedges, and the sets of vertices that belong to each hyperedge becomes a group of vertices to be merged together. At each coarsening level, the coarsening scheme stop as soon as the size of the resulting coarse graph has been reduced by a factor of 1.7[6]. The third phase of the algorithm is to compute a k-way partitioning of the coarsest hypergraph such that the balancing constraint is satisfied and the partitioning function as mincut is optimized. During the fours phase - uncoarsening phase, a partitioning of the coarser hypergraph is projected to the next level finer hypergraph, and a partitioning refinement algorithm is used to optimize the objective function without violating the partitioning balancing constraints. At the final iteration of algorithm CHAMELEON determines the similarity between each pair of clusters by taking into account both at their relative inter-connectivity and their relative closeness. It selects to merge clusters that are well inter-connected as well as close together with respect to the internal inter-connectivity and closeness of the clusters. By selecting clusters based on both of these criteria, CHAMELEON overcomes the limitations of existing algorithms that look either at the absolute inter-connectivity or absolute closeness.

4. Performance Analysis

The overall computational complexity of CHAMELEON depends on the amount of time it requires to construct the K - nearest neighbors graph and the amount of time it requires to perform the two phases of the clustering algorithm. In [5] was shown that CHAMELEON is not very sensitive of values k for computing the k-nearest neighbor graph, of the value of MINSIZE for the phase I of the algorithm, and of scheme for combining relative inter-connectivity and relative closeness and associated parameters, and it was able to discover the correct clusters for all of these combinations of values for k and MINSIZE. In this section, we present experimental evaluation of clustering using hMETIS hypergraph partitioning package for k-way partitioning of hypergraph and for

recursive bisection [4] and CLUTO 2.1.1- A Clustering Toolkit [7].

We experimented with five different data sets containing points in two dimensions: "disk in disk", t4.8k, t5.8k, t8.8k, t7.10k [8].The first data set, has a particularly challenging feature that two clusters are very close to each other and they have different densities and circles shapes. We choose the number of neighbors k=5, 15, 40, MINSIZE = 5%. Looking at Figure 1, a) we can see the results of the k-way partitioning of hypergraph by hMETIS package [8] and b) merging process by CLUTO package [8] with k=5 nearest neighbors. Looking at Fig.1 we can see that in both cases we have not correctly identified the genuine clusters.

The data set t8.8k has eight clusters of different shapes, size and orientation, some of which are inside the space enclosed by other clusters. Moreover, it also contains random noise such as a collection of points forming vertical streaks.

Looking at Fig. 2 with k=5 nearest neighbors we can see that hMETIS also compute k-way partitioning of hypergraph with mistakes closer to the border of two classes and CLUTO can not effectively merge clusters for such type of dataset using asymmetric k-NN, with k=5. It means that algorithm of the partitioning phase is very sensitive to the value of k for spherical shapes of clusters and to the types of k-NN graph (symmetric and asymmetric). It is very important to choose an optimal value of k, because with k=16 and more, and only for symmetric k-NN with weights of edges equal to the number of common neighbors we obtain final clustering with minimum percentages of errors.

a) b)

Fig. 1 Data set "disk in disk" with k=5 nearest neighbors and asymmetric k-NN: a) k-way partitioning by hMETIS; b) final clusters by CLUTO

a) b)

Fig. 2 Data set "t8.8k" with k=5 nearest neighbors and asymmetric k-NN: a) k-way partitioning by hMETIS; b) final clusters by CLUTO

5. Modeling the cluster similarity

As we remark above the CHAMELEON operates on a sparse graph in which nodes represent data items and weighted edges represent similarities among the data item (symmetric graph) [5]. In our algorithm during first phase we construct an asymmetric k-NN graph and there exists an edge between two points if for one of it there exist closest neighbor among all existing neighbors according to the value of k. Note that the weight of an edge connecting two objects in the k-NN graph is a similarity measure between them, as usual a simple distance measure (or inversely related to their distance).

In our algorithm the weight of an edge we compute as weighted distance between objects. Fig. 3 represents the k-NN graph for data set "disk in disk" with k=5. During coarsening phase the set of smaller hypergraphs is constructed. In the first stage of coarsening process we choose the set of vertices with maximum degrees and matched it with a random neighbour. On the other stages we visit each vertex in a random order and matched it with adjacent vertex via heaviest edge. Note that usually the weight of an edge connecting two nodes in a coarsened version of the graph is the number of edges in the original graph that connect the two sets of original nodes collapsed into the two coarse nodes. In our case we compute the weight of the hyperedge as the sum of the weights of all edges that collapse on each other during coarsening step. We stop the coarsening process at each level as soon as the number of multivertices of the resulting coarse hypergraph has been reduced by a constant less then 2 (Fig. 3).

a) b)

Fig. 3 Data set "disk in disk": a) k-NN graph with k=5; b) hypergraph after third level of coarsening

On the next level of algorithm we produce a set of small hypergraphs using k-way multilevel paradigm [6]. We start the process of partitioning by choosing k most heavier multivertices, where k = 8, 16, 32. After that we gathering one by one all neighbors from each chosen vertex and obtain the initial partitioning w.r.t to the balancing constant. The problem of computing an optimal bisection of a hypergraph is NP-hard. One of the most commonly used objective function is to minimize the hyperedge-cut of the partitioning; i.e., the total number of hyperedges that span multiple partitions [6]. One of the most accuracy algorithm of partitioning the hypergraph is Kernighan-Lin / Fiduccia - Mattheyses algorithm, in which during each pass, the algorithm repeatedly finds a pair of vertices, one from each of the subdomains, and swaps their subdomains. The pairs are selected so as to give the maximum improvement in the quality of the partitioning even if this

BocTOïHO-EBpOnêûcKuû myprai nepegoBUK TeKHOnoruii

improvement is negative). Once a pair of vertices has been moved, neither are considered for movement in the rest of the pass. When all of the vertices have been moved, the pass ends. At this point, the state of the bisection at which the minimum edge-cut was achieved is restored. In our experiments we use a greedy refinement algorithm developed by George Karypis [6], but as the gain function for each vertex we compute the differences between the sum of the weights of edges incident on vertex that go to the other partition and the sum of edges weights that stay within the partition. We choose the vertex with maximum positive gain and move it if it result in a positive gain, so we works only with boundary vertices.

After the partitioning of hypergraph into the large number of small parts we start to merge the pair of clusters for which both relative inter-connectivity and their relative closeness are high [5]. In our research we use George Karypis formula to compute the similarity between sub-clusters. Looking at the Fig.1 b we can see that for data set "disk in disk" was obtained not correct clustering results. Thus we suggests to modified the above mentioned expression by change the relative inter-connectivity to a new expression that estimate the average weights of edges in each sub-graph and the number of edges that connect two partitions to the number of edges that stay within the smallest partition.

Looking at the Fig. 4 we can see the correct clustering results for the same data set "disk in disk" using our suggested expression. For another above mentioned data sets we obtain as well accuracy results. In all experiment we use k=5 and in our approach the correctness of classification really doesn't depend of the value of k and of the k-NN type.

a) b)

Fig. 4 Clustering results using a new approach to the sub-clusters merging, k=5: Data set "disk in disk"; b) Data set "t8.8k.txt"

6. Conclusion

In this paper, we present our experiments with hierarchical clustering algorithm CHAMELEON for circles cluster shapes with different densities using hMETIS program that used multilevel k-way partitioning for hypergraphs and a Clustering Toolkit package that merges clusters ba-

sed on a dynamic model. In CHAMELEON two clusters are merged only if the inter-connectivity and closeness between two clusters are comparable to the internal interconnectivity of the clusters and closeness of items within the clusters. The methodology of dynamic modeling of clusters is applicable to all types of data as long as a similarity matrix can be constructed.

Experimental results showed that hMETIS compute k-way partitioning of hypergraph with mistakes closer to the border of two classes and CLUTO can not effectively merge clusters using asymmetric k-NN, with k=5.

We present a modified hierarchical clustering algorithm that measures the similarity of two clusters based on a new dynamic model with different shapes and densities. The merging process using the dynamic model presented in this paper facilitates discovery of natural and homogeneous not only circles cluster shapes.

Experimental results showed that this method is not sensitive to the value of k and doesn't need a specific k-nearest neighbor graph creating.

Bibliography

1. Alpert, C. J. Multilevel circuit partitioning [Text] / C. J. Alpert, J. H. Huang, A. B. Kahng // 34th ACM/IEEE Design Automation Conference - Anaheim, 1997 - pp. 530533.

2. Guha S. ROCK: Robust Clustering using links [Text] / S. Guha, R. Rastogi, K. Shim // Proceedings of the International Conference on Data Engineering ICDE'99

- San Diego, 1999.

3. Karypis G. Multilevel hypergraph partitioning: Application in VLSI domain [Text] / G. Karypis , R. Agg-arwal, V. Kumar, Sh. Shekhar // Proceedings of the Design and Automation Conference - 1997.

4. Karypis G. hMETIS 1.5.3: A hypergraph partitioning package .Technical report [Text] / G. Karypis V. Kumar // Department of Computer Science, University of Minnesota -1998.

5. Karypis G. CHAMELEON: A Hierarchical Clustering Algorithms Using Dynamic Modeling [Text] / G. Karypis, E.-H. Han, V. Kumar // IEEE Computer - 1999

- 32(8) - pp. 68- 75.

6. Karypis G. Multilevel k-way hypergraph partitioning [Text]/ G. Karypis, V. Kumar // Proceedings of the Design and Automation Conference -1999.

7. Karypis G. CLUTO 2.1.1. A Clustering Toolkit. Technical report [Text] // Department of Computer Science, University of Minnesota - 2003.

8. Computer Science & Engineering [Electronic resource] - Access mode : \WWW/ URL: http://www.cs.umn. edu/~karypis - Title from the screen.

9. Mitchell T. M. Machine Learning [Text] / T. M. Mitchell - McGraw Hill, 1997 - 414 p.

10. Zhang T. BIRCH: an efficient data clustering method for very large databases/ T. Zhang, R. Ramakrishnan, M. Livny// SIGMOD'96 - 1996 - pp. 103-114.

i Надоели баннеры? Вы всегда можете отключить рекламу.