Научная статья на тему 'Cluster algorithms: theory and methods'

Cluster algorithms: theory and methods Текст научной статьи по специальности «Математика»

CC BY
246
77
i Надоели баннеры? Вы всегда можете отключить рекламу.
Область наук

Аннотация научной статьи по математике, автор научной работы — Akume D., Weber G. -w

Supported by DAAD under grant number A/001/31321, at Chemnitz university of technology. The authors are responsible for possible misprints and the quality of translation.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Cluster algorithms: theory and methods»

Вычислительные технологии

Том 7, № 1, 2002

CLUSTER ALGORITHMS: THEORY AND METHODS*

D. Akume

Computer Science Department, University of Buea, Cameroon e-mail: daniel.akume@minesup.gov.cm G.-W. Weber

Faculty of Mathematics, Chemnitz University of Technology, Germany e-mail: weber@mathematik.tu-chemnitz.de

Цель данной работы состоит в изучении пригодности некоторых методик в кластеризации контрактов на получение банковских кредитов. Представлены две задачи оптимизации, которые приводят к двум кластерным алгоритмам. Проводится сравнение методов на основе точных оценок, характеристик пространственной и временной сложности.

1. Statement of problem

In economics, social problems, science and technology, numerous questions of clustering finite sets arise. The present survey article was stimulated by the cluster analysis from loan banking [18], done in cooperation between German “Bausparkassen” and the Center for Applied Computer Science, Cologne (ZAIK). Here, the contracts (accounts) with completely known “saving phase” have to be partitioned for purposes of liquidity planning. Our article gives a small survey and an outlook as well.

A solution to the cluster problem is usually to determine a partitioning that satisfies some optimality criterion. This optimality criterion may be given in terms of a function f which reflects the levels of desirability of the various partitions or groupings. This target is the objective function.

Assuming there are n accounts (objects) and m features for each account we seek to partition these n accounts in m dimensional spaces into meaningful clusters, K in number. The clustering is achieved by minimizing intracluster similarity and maximizing intercluster dissimilarity. Mathematically, the problem can be formulated as an optimization problem as follows:

For a given or to be suitably chosen K G N,

min f (С) (1)

subject to C = (C1, ..., Ck), C1Jj ... UCk = П,

whereby П = {x1, ..., xn} is the set of objects to be grouped in K disjoint clusters Ck. Finally, f is a nonnegative objective function. Its minimization aims at optimizing the quality of clustering.

*Supported by DAAD under grant number A/001/31321, at Chemnitz university of technology.

The authors are responsible for possible misprints and the quality of translation.

© D. Akume, G.-W. Weber, 2002.

15

2. The method

Generally, it is possible to measure similarity and dissimilarity in a number of ways, such that the quality of the partition will depend on the function f in (1). In this paper, we investigate clustering based on two different choices for f.

In our first method,

fMST(C) := maxd(xi,xj) — min min d(xj,Xj). (2)

XiyXj Xi^Cv Xj £C^,=Cv

The minimization of fMST means the maximization of the minimum distance between any two clusters. Note that the first term on the right hand side does not depend on C. The criterion

fMST can be interpreted1 as the “compactness of the clustering”. There are efficient (polynomial

time) algorithms for this kind of problem.2 In our second method,

K nk

fE (C ):=££,

= ! \

k= 1 i=1

£(xij — 'Zkj)2, (3)

j=1

which measures the distance of the objects from the centroids zk of their respective clusters. Hereby,

nk : 1 Ck 1 j Ck {x1k j • • • j xnfck} (k 1; • • • j K)

and

d(xi, xj) :=

\

k=1

is the Euclidian metric. The problem of minimizing f^ has been shown to be NP-hard.3

A direct way of solving the cluster problem is to evaluate the objective function for each choice of the clustering alternatives and then choose the partition yielding the optimal (minimum) value of the objective function. However, this procedure is not practical even for small values of n and K as the following lemma4 shows.

Lemma 1. The number of ways of partitioning n elements into K groups is determined by Stirling’s number of the second kind S:

1 /KN

1 -\K-1 K

S (n,K ) = T^r (—1)K-^K)

i= 1

That is, with n =10 and K = 4, there are 34105 ways of partitioning 10 objects into 4 clusters. This number becomes computationally explosive as n increases, making it impractical to solve for the optimal partition by complete enumeration. The computational running time for such an optimization problem increases exponentially. This leads us use heuristic algorithms, which in many cases will provide only good approximate solutions.

1See Vannahme, 1996, [18].

2See Vannahme, [18].

3No one has so far been able to develop any polynomial time decision algorithm for this problem. It has been shown that it corresponds to the hardest problems in the NP-class. See [18], page 58: reduction of 3SAT problem. For more information about complexity see Garey and Johnson, 1979, [10].

4A proof of this lemma can be found in [2].

3. Clustering algorithms

Hierarchical algorithms. Clustering techniques are referred to as hierarchical if the resultant subdivision has an increasing number of nested clusters. Otherwise, they are non-hierarchical.

Hierarchical techniques can be further classified as either divisive or agglomerative. A divisive (deglomerative) method begins with all objects in one cluster. The cluster is gradually broken down into smaller and smaller clusters.

In an agglomerative hierarchical clustering process one starts with each of the n objects in a single object cluster and groups the two nearest (or most similar) objects into a cluster, thus reducing the number of clusters to n — 1. The process is repeated until all objects have been grouped into the cluster containing all n objects.

A formal definition of a hierarchical technique for our purposes is presented as follows:

Definition 1. Let n = {x1, • • •, xn} be a set of n objects. A system S = (C1, C2, • • •, CK) of subsets of n is called a hierarchy of n if all sets C1, C2, • • • , CK C n are mutually different and if for any two sets Ck, Ci C n with Ck = Ci only one out of the following three possibilities can occur

Ck H Ci = 0 or Ck C Ci or Ci C Ck •

The sets in S = (C1, C2, • • • , CK) are known as the classes of n.

Hierarchical techniques are used when the number of clusters is not specified. A serious disadvantage of this technique is that the fusing or dividing process cannot be reversed.

We shall be concerned in this paper with a hierarchical agglomerative method.

Partitioning algorithms. Clustering techniques are referred to as partitioning if process leads to the object set n being grouped into K clusters. The number of clusters K is predetermined and one starts with an arbitrary start partition into K clusters. The idea is to improve on this partition step by step. The advantage of this method is that objects can move freely from one cluster to another. By so doing, the final partition will be good even if the start partition was

poor. The difficulty here is a priori to fix the number of clusters K that would be reasonable.

This is a difficult question; the best thing to do is to vary K suitably as a parameter.

3.1. The single-link5 hierarchical clustering algorithm

The hierarchic agglomerative6 single-link algorithm which is used to solve optimization problem

(1) with the objective function (2), interpreted within the context of graph theory, is the search for a minimum spanning tree7 (MST) from which edges are deleted in order of decreasing length [1].

The connected sets after deletion are the single-link-clusters. The order of deletion and the structure of the MST ensure that the clusters will be nested into hierarchy.

5Single-link methods are hierarchical techniques which search for d-clusters in a graph defined by the set n = {xi, ..., xn} of objects (vertices).

6At each step, an agglomerative singe-link algorithm fuses d closest connected components to a new connected component. It starts with each object as a cluster and ends with all objects in one cluster. The distance between two components or clusters is defined as follows: dist(Ck,Ci) := min d(xj,xj).

XiECk,Xj ECi

7Given n points in the m-dimensional Euclidean space, a tree spanning these nodes (vertices) is a set of edges joining pairs of vertices such that (1.) no cycles occur, (2.) each point is visited by at least one line, (3.) the tree is connected. Here, (3.) means: any two points are connected by a finite sequence of neighbouring edges (i. e., by a “polygon”). The length of a tree is the sum of the lengths of the edges which make up the tree. The minimum spanning tree is then defined to be the tree of minimum length. These ideas come from graph theory.

The objects to be clustered are regarded as the set of vertices n of a graph. These together with the set E(n) of induced edges (i.e., all possible) form a complete graph G = (n,E(n)). The length of the edge between any two vertices xi and xj is the dissimilarity d(xi; xj) between both objects.

Let A = (xij)ie{1,...,n},je{1, be an object matrix of size n x m with n objects each with

m attributes8. In our specific case of the loan bank (“Bausparkassen”) the objects represent accounts and in general, n ranges from 500 000 to 3 000 000. A hierarchical clustering method is a procedure for transforming the dissimilarity matrix into a sequence of nested partition [11]. The direct input to the hierarchical clustering is the dissimilarity matrix D, which is usually generated from an object matrix A. Each entry of the dissimilarity matrix D = (dik)i,ke{1, ...,n} represents the pairwise indices according to the rows and columns of object matrix A. Because the Euclidean distance is the most common Minkowski metric, we use the Euclidean distance to measure the dissimilarity between objects. That is,

di,k = * ^(xij — xkj)2, 0 < i, k < n.

\ k=1

The output of a hierarchical clustering algorithm can be represented by a dendogram (i.e., a level tree of nested partitions). Each level (denoted as li; 1 < i < n, consists of only one node (different to the regular tree), each representing a cluster. We can cut a dendogram at any level to obtain a clustering.

Definition 2. Two vertices of a graph are said to be d-connected if there exists a sequence of vertices xi = x1; • • • , xk = xj, such that the distance between xi and xi+1; l £ {1, ... , k — 1}, is always less than d.

Definition 3. A subset C of vertices of n is called a d-cluster in any of the following cases:

1) each two vertices from C are d-connected,

2) vertice xi £ C being d-connected with vertice xj implies: xj is also in C.

Therefore, a d-cluster is a connected component with each edge having length less than d. Prim’s theorem [16] says

Theorem 1. Let n be a set of vertices, T a minimum spanning tree of G = (n, E(n)). Moreover, C1, • • • , CK be the clusters we obtain after deleting from T all edges longer than d. Then, C1; • • • , CK are all d-clusters.

Therefore, the problem is reduced to one of determining a minimum spanning tree of n and delete all edges longer than d.

The problem of finding the minimum spanning tree can be solved by an efficient algorithm which leads to an optimal solution [1].

In the following, we present Prim’s algorithm for the single-link technique. The idea is to first obtain the minimum spanning tree and delete the longest edges from it successively.

Step 1. Fix an upper bound rmax £ {i, ... , n} for the number of clusters. Put all objects in one cluster.

Step 2. Sort the longest rmax edges in decreasing order. Put t:=1.

Step 3. Delete the longest edge from the tree and place each of the objects of the resultant subtrees in a separate cluster.

Step 4. If t = rmax: stop, else put t:=t+1 and go to step 3.

8See [20].

This algorithm requires time complexity of O(n2) and O(n2) space complexity to group n objects into K clusters [16].

3.2. K-means

Optimization problem (1) with the objective function (3) is usually referred to as error sum of

squares or centroid clustering method.

The centroid method minimizes the objective function /^ for a given K and partition C = (C1; • • • , CK) of the n objects n = {x1; ... , xn} into K clusters. Here, Sj is the centroid of the cluster Cj (j = 1, • • • , K), such that

Phase 2. Create a new partition by assigning each object to its nearest centroid.

Phase 3. Compute the new centroids.

Phase 4. Repeat phases 2 and 3 until either the objective function no longer improves or

the maximum number of iterations is attained.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Phase 5. Delete empty clusters.

Minimum distance method. The process in phases 2 and 3 is referred to as minimum distance method and produces a minimum distance partition.

Definition 4. A partition C = (C1; ..., CK) is called a minimum distance partition if each object is assigned to the cluster from whose centroid its Euclidean distance is the shortest:

K

k=1 Xi£Ck

Note that here we have eliminated the square root from (3). The centroid Sj- of each cluster Cj is defined as follows:

|cfc |

j := i r-i i / for i = 1, ..., m.

This definition of the centroid as a mean value vector of the elements in a cluster necessarily guarantees optimality for that single cluster.9 The problem of obtaining a partition C of all objects in K clusters minimizing (3) has been proven to be NP-hard [14]. Therefore we are left with the choice of heuristic algorithms which yield good (not necessarily optimal) solutions in acceptable time.

A general algorithmical partitioning technique for this problem can be stated as follows:

Phase 1. Put the number K of clusters equal to a selected integer and choose the maximum number of iterations. Furthermore, choose an initial partition and compute the centroids.

The following theorem [17] implies that clustering due to minimum distance partition is uniquely determined by the cluster centroids.

9See [17], p. 18.

Theorem 2. The minimum distance method produces a separating hyperplane between any two clusters with different centroids of a minimum distance partition.10

In fact, the proof verifies that the hyperplanes

{x £ Rm | II x - Zq II2 = II x - Zp II2} (q = p)

fulfill the desired properties.

Theorem 3. The iterated minimum distance method converges.11

Here, we repeat the proof.

Proof. Let C(t) be the partition after t iterations. It follow that

(c(t)) = ££ II xi - Z ||2 > £ £ 1 xi - Z H2

= £ E I xi - Z«t) II2 > £ £ I x, - Z<t+1) I2 = /E(C(t+1)).

k=1 x,£C«> k=! x,£C«>

This algorithm will stop since /^ (C) falls monotonically and it is bounded below by zero, and the number of ways to group n objects in K clusters is finite. ■

The exchange procedure. This technique serves to improve the minimum distance method. This technique systematically moves an object to a new cluster. In order to get the effect of this movement on the objective function it is necessary to have updating formulae that indicate the change in the square-error norm and centroid of a cluster Cp when an object x, is either added or removed from it.

The updating formular when an object is added. Let Cq = Cp U {x,}, x, / Cp. The

new centroid for Cq will be

„-£x = (5-x+xi) = n~+T(npZp+xt)- (5)

nq 'Cq np + 1 \xeCq I np + 1

For the error sum of squares norm /^ (Cq) of a cluster Cq we obtain the following formula

fE (Cq) = fE (Cq) + +T II x, - Zk II2 • (6)

np + 1

The updating formular when an object is removed. Let Cq = Cp\{xi}, x, £ Cq, np > 1. Then

1

Zq = — np

and

Zq __ (npZp x,) (7)

np - 1

n

^ (Cq) = (CP) II x, - Zk II2 • (8)

np - 1

The exchange procedure can now be defined by means of these updating formulas.

10For a proof see [17], p. 34.

11 See Spaeth, 1983, [17], p. 29.

Zq

Step 1. Fix the number of clusters and the maximum number of iterations. Choose an initial partition and and compute the centroids.

Step 2. Create a new partition by assigning each object to the closest centroid.

Step 3. For each object x, £ Ck, test systematically to find out whether a cluster C exists for which the square-error norm improves if x, is moved into it. That is, if

n1 II ^ II2 ^ nk || 112

x, - Z1 II2< -7 y x, - Zk ||2

n1 + 1 nfc - 1

occurs, x, is moved from cluster Ck to cluster C. If at least one such cluster exists, then move x, to the cluster that causes maximum improvement in the objective function. Compute the new centroids.

• Step 4. Repeat steps 2 and 3 until either no other exchanges are necessary or the maximum number of iterations is attained.

Theorem 4. The exchange algorithm converges to a minimum distance partition. This partition may not be globally optimal but may be a local minimum.

The proof of this theorem can be found in [18], p. 71.

Running time and space considerations. The space complexity increases linearly since only the objects themselves are stored. The running time of the algorithm is O(cnK), with c being an iteration constant that depends on the number of objects, the norm and the structure of the data. Each iteration requires O(n + K) steps to compute the centroids and O(nK) to compute a new minimum distance partition. The switching algorithm also requires O(nK) steps

[18].

Based on time and space complexity considerations, therefore, the centroid algorithm is appropriate for handling large data sets.

4. The clustering process

The clustering results are to be used to carry out simulations of future customer behaviour. The aim is to identify groups of customer accounts that behave similarly within a specific period based on known data from a real loan bank “Bausparkasse”. This should enable the forecasting of customer behaviour for a future period.

Let A = (xj),e{1, ...,n},je{1, ...,m}, as suggested above, be an object matrix of type n x m with n objects each with m attributes. In our specific case of a loan bank the objects represent accounts. In general, n ranges from 500 000 to 3 000 000.

The relevant account attributes are separated into two groups — the nominal and the rational ones.12 The entire set of accounts is first of all filtered into subgroups based on the following nominal attributes: account owner natural or legal entity13, tarif14, tax discounts15, loan advance16, major account17 and phase18.

12Vannahme [18], p. 80. If data is on a nominal scale, then one can only compare them if they are the same or different. On the other hand, if data is rational, then one can even measure distances between objects.

13Legal entities are allowed to own accounts at the loan bank.

14See Vannahme, [18], p. 133: A tarif describes the conditions underlying each class of accounts, e.g., loan interest etc.

15The German government grants incentives to some tarifs.

16There is the possibility of obtaining an early loan at an interest rate higher than that guaranteed by the contract at maturity.

17Here we mean accounts that at the beginning make a huge one-off saving and thereafter continue saving in very small bits.

18The phase considered here is the saving phase. In this phase the customer has to a great extent free decision.

The data is then partitioned using either the centroid or single-link method. We dwell on the following five attributes: nominal amount, savings, saving coefficient, eligibility coefficient and age of customer. Since the attributes are not all of the same units, they have to be scaled in order to be comparable and also normalized.19

The following attributes are worthy of explanation:

• Nominal amount: amount to be saved by customer plus loan to be obtained from building society (loan bank) as specified in the contract upon opening of account.

• Savings coefficient: savings in an account as a fraction of total nominal amount of accounts not yet approved for loan.

T

• Eligibility coefficient: an assessment of the intensity of saving with time = f savings(t)dt,

0

t being time.

The dissimilarity measure used is the weighted Euclidean metric

m

d(xi, Xj) = . £ Afc(xifc - )2,

\ fc=i

m

whereby Ak > 0, Ak = 1.

fc=i

From Section 3, it is clear that the single-link method is not suited for huge (more than 5 000 objects) data sets. Therefore, we concentrate more efforts on the centroid method.

Centroid method. Clustering by the centroid method is carried out on data, initially filtered into subgroups using nominal criteria as indicated above.

The number of objects in each subgroup is greater than 100 on the average (see Table 1). This is necessary for later forecasting if meaningful probability distributions are to be achieved.

Table 1

Distribution of accounts in clusters, clustered by nominal amount, saving saving coefficient [18]

Cluster Number Nominal amount Saving Saving coefficient

1 182 7 27.1 116

2 92 11 11.1 366

3 28 59 8.4 152

4 108 15 8.1 209

5 151 14 10.6 154

6 185 15 5.4 24

7 158 19 8.7 95

8 34 59 5.9 35

9 257 10 14.9 55

10 127 7 40.4 189

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

This method is implemented as follows: Keep applying the minimum distance method until the error sum of squares norm no longer gets improved. Then apply the exchange procedure. If the exchange procedure causes improvement, apply the minimum distance method again.

For 50 000 accounts the number of minimum distance iterations recorded lies between 200 and 300. The number of exchange iterations recorded lies between 5 and 10.

19For a more detailed discussion of scaling and normalization of variables see Vannahme [18], p. 83 and 110.

Carrying out the centroid method on a SUN SPARC server 1 000 computer with a SPARC processor to group 50 000 objects in 100 clusters took 225.75 minutes [18].

Single-link. The single-link algorithm is particularly suited for identifying geometrically non-elliptical structures in a set of objects. The forecast obtained by implementing the singlelink algorithm to cluster loan banking accounts is not very meaningful. Almost all combinations of attributes will contain an account that saves at the regular rate.20

5. Clustering assessment techniques

Due to their diverse mathematical representation, it is extremely difficult to compare clustering algorithms from a purely mathematical perspective. Besides, the suitability of an algorithm will usually depend on the dissimilarity measure and type of data21, as well as the relevance to the investigation under study. For the purposes of this paper, the relevance is associated with good forecasting.

The following indices22 have been used to measure the relative suitability of a clustering method as compared to another in this paper .

Huberts r-index. The natural structure of the data is represented by the symmetrical dissimilarity matrix

D = (djwith dj = || Xj — Xj ||2 for two objects Xj,Xj.

The structure obtained after clustering is also represented as a symmetrical n x n matrix defined as follows:

Y (yij)i,j€1, ...,n

j h it w Ccl(xi),cl(xj ))

whereby = d(zfc ,Zi);

Sfc = — V Xj and nfc ^ xi^Ck

c/(xj) = k, if object Xj lies in cluster k

That is, the distances between cluster centroids associated with the respective objects make up the elements of the matrix. The simple r- index takes the following form

n—1 n n—1 n

r = £ £ X(i,j)Y(i,j) := £ £ dijyij•

i=1 j=i+1 i=1 j=i+1

The larger the value of the r-index is, the closer is the clustering, to the structure represented by the matrix D.

The normalized r-index takes on values between —1 and 1 and it is of the following form:

n— 1 n

£ (X(i,j) — mx)(Y(i,j) — )

r = i=1 j=i+1___________________________________

sxsy

20Each tarif fixes a percentage of the nominal amount that should be saved each agreed period.

21See Vannahme, [18]: The single-link method for instance easily detects non-elliptical structures in data. For elliptical structures, however, the results are sometimes useless.

22See Jobson, [8] and [9].

whereby

n— 1 n n— 1 n

M = nn2-; mx = M VE X (i,j); my = M VV Y (i,j);

M ^ ^ y M

i=1 j=j+1 j=1 j=j+1

n— 1 n n— 1 n

sx = £ X2(i,j) — mx; Sy2 = £ Y2(i,j) — mS•

i=1 j=j+1 j=1 j=j+1

A disadvantage of the T-index is that it requires quadratic running time owing to the fact

that the dissimilarity matrix is used.

Davies-Bouldin index. This index relates the looseness and the compactness of a cluster.

For two clusters Cj and Ck it is defined as follows:

R = 6j + 6fc

j,fc dist(Cj ,Cfc) •

Here, ej is the mean distance from the centroid and dist(Cj, Ck) is the distance between the centroids of cluster j and cluster k.

The index of the kth cluster is

Rfc = max(Rj,fc). j

The Davies-Bouldin index for the entire clustering is

1 K

DB(K ) = , K> 1.

fc=1

The smaller the index, the better the clustering with regards to its compactness. This is because in this case the cluster diameter is small as compared to the cluster distances from one another. The index value is zero in case the number of clusters equals the number of objects.

Sum of squares norm. The objective function /^ of the centroid method offers a possibility of comparing clusterings. The sum-of-squares norm measures the compactness of the clusters. The smaller its value, the more compact the clusters are with regards to the attributes.

Component index. The component index Fj measures the contribution of each object component (attribute) to the clustering. The sum-of-squares norm is computed for each attribute i

K

E(j) = £ £ (xji — zfcj)2 for i = 1, ..., m.

fc=1 xj

Now, E(j) is used as the index for each attribute i to be considered in an overall index bj involving all the k clusters. For i = 1, ... , m we get

b = V nk ((zk)2 — i2) = (V nk(^)2) — ni2, fc=1 \fc=1 /

whereby Z = (iv1, ... , Zm) is the mean vector of all objects. The weighted quotient

F‘ = (f-t)An-K) for i =

is an index for the looseness of the individual clusters with regards to an attribute.

The larger the value for Fj, the more the attribute i will be similar in the individual clusters. It is always advisable to assess with as many indices as possible, since the various indices consider differing aspects of clustering.

6. Results and discussion

The single-link algorithm is particularly suited for identifying non-elliptical structures in a set of objects. The forecast obtained by implementing this algorithm to cluster loan banking accounts is not very meaningful. Almost all combinations of attributes will contain an account that saves at the regular rate23. The single-link algorithm always isolates these, since it fuses any two clusters having the nearest neighbours. This implies the fact that the algorithm detects customers with very irregular behaviour and places each of them them in a single cluster while the rest of the accounts are placed in a single common cluster.

Upon grouping 150 accounts in 30 clusters one gets about 20 clusters each containing just one or two customer accounts. The rest is in the remaining 10 clusters. Even changing the clustering criteria would not change anything.

The fact that a few clusters are heavily loaded whereas most are almost empty, does not guarantee meaningful statistical results, for which the clustering is actually intended. Therefore, the single-link method cannot be used to cluster customer accounts of loan banking. This is because the clusters would subsequently have to be used to do forecasts based on probability distributions.

Table 2 presents results obtained from the centroid method for the first 10 clusters.

Table 2

Centroids of the first 10 clusters [18]

Cluster Nominal amount Saving Saving coefficient

1 29.75 4.31 29.46

2 97.40 0.05 12.77

3 9.34 16.04 51.06

4 99.20 0.06 25.75

5 99.55 28.20 39.80

6 24.42 5.04 58.03

7 99.11 1.66 4.82

8 25.37 23.56 40.22

9 99.65 1.77 15.86

10 47.09 30.89 40.48

On the other hand, this algorithm makes use of the dissimilarity matrix or it has to compute object distances at each iteration. The running time increases quadratically as a function of the data size. Therefore, this method is not very useful for the data set currently under investigation.

Results obtained by using these two algorithms can be compared with the help of the indices discussed in Section 5.

Both methods correspond to two different objective functions of the general clustering optimization problem. For comparison purposes 3 686 customer accounts are considered, each of which is in the saving phase24 for one year. The clustering criteria were nominal amount, saving and saving coefficient for each contract.

The Davies-Bouldin index for the single-link method is lower. This is because the Davies-Bouldin index gives higher values for clusters with more objects. The single-link method sorts out the “outliers” in individual clusters, resulting to small numbers of objects in many clusters.

23Each tarif fixes a percentage of the nominal amount that should be saved in each agreed period.

24In loan banking, a contract evolves in three main phases: saving phase, assignment phase and loan phase [13].

On the other hand, the Davies-Bouldin index measures the looseness of the cluster and obviously obtains low values for the single-link method.

The centroid method distributes the objects relatively uniformly amongst the clusters.

The r-index compares the distance of the objects amongst one another to the distance in-between the cluster centroids. The r-index of the single link method differs greatly from that of the centroid method. This index assesses the single-link method to be worse than the centroid method. This is due to the fact that the single-link method builds rings which do not stand out very clearly on this data set.

The error sum of squares norm for the single-link method is higher than that of the centroid method.

A comparison of the component index of the attributes being used reveals that the centroid method is far better than the single-link method. The component indices of the single-link method are a lot worse than those of the centroid method.

7. Conclusion and outlook

Based on the various indices, one can conclude that the centroid method is suitable for the forecasting. The single-link method can be used to sort out contracts with perculiar properties (outliers) amongst the contracts.

The hierarchical clustering algorithm single-link is not suitabe for forecasting for two main reasons.

Firstly, it requires too much space (storage). An attempt to minimize it leads to an equivalent increase in running time. This is caused by the fact that this method uses the dissimilarity matrix as a basis of its computations.

Secondly, the single link method cannot be used to identify clusters in the basis year against the year for which forecasing is sought. This is because clustering is strongly related to the distances between contracts.

The single-link method could, however, be useful for other problem sizes and other objectives. Even the running times for large problems seem to be within acceptable limits.

As two further approaches of evaluating discrete data we mention automatical classification (see [4]) and formal concept analysis (see [7]).

The authors intend carrying out future research on the following topics:

• application of cluster methods on portfolio optimization (stocks and bonds), life insurances and pensions,

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

• comparison of exact and heuristic algorithms under the criteria of both quality of the solutions (error bounds) and complexity,

• investigation of traffic on highways by means of cluster methods: finding “typical drivers”. This project of ZAIK and momatec company (Aachen, Germany) aims at traffic regulation and, finally, navigation systems.

Acknowledgements: We want to express our gratitude to the Faculty of Mathematics of Chemnitz University of Technology for giving us the opportunity of scientific collaboration, especially Prof. Dr. K. Beer and Prof. Dr. B. Luderer. Furthermore, we thank the Center for Applied Computer science, Cologne, for making the second author familiar with loan banking, and to Prof. Dr. Yu. Shokin and Dr. L. Chubarov and for encouragement and support.

References

[1] ÄHüJA K., MAGNANTI L., Orlin J. Network Flows. N.J.: Prentice-Hall, 1993.

[2] äldenderfer S., Blashfield K. Cluster Ananlysis. SAGE Publ., 1985.

[3] ÄNDERBERG M. Cluster Ananlysis for Applications. N.Y.: Acad. Press, 1973.

[4] BOCK H. H. Automatische Klassifikation. Vandenbroeck & Ruprecht, 1974.

[5] BRüCKER P. On the complexity of clustering problems // Optimizations and Operations Research. Springer. 1974. P. 45-54.

[6] DüRAN S., Odell L. Cluster Analysis — a Survey. Berlin, Heidelberg, N.Y.: SpringerVerlag, 1974.

[7] GANTER B., Wille R. Formale Begriffsanalyse // Mathematische Grundlagen. Berlin, Heidelberg, N. Y.: Springer-Verlag, 1996.

[8] JOBSON J. D. Applied Multivariate Data Analysis. Vol I: Regression and Experimental Design. Springer, 1992.

[9] JOBSON J. D. Applied Multivariate Data Analysis. Vol II: Categorical and Multivariate Methods. Springer, 1992.

[10] GAREY M., JOHNSON D. Computers and Intractability, A Guide to the theory of NP-Completeness. N.Y.: Freedman and Company, 1979.

[11] JAIN A., Dubes R. Algorithms for Clustering Data. Englewood Cliffs. N. J.: Prentice-Hall, 1988.

[12] KNAB B., SCHRADER R., Weber I. ET AL. Mesoskopisches Simulationsmodell zur Kollektivfortschreibung. Center for App. Comp. Sci., Report 97.295, 1997.

[13] LEHMANN w. Die Bausparkassen. Frankfurt am Main: Fritz Knapp Verlag, 1965.

[14] MEGIDDO N., SüPOWIT K. On the complexity of some common geometric locations problems // SIAM J. on Comp. 1984. Vol. 13(1). P. 182-196.

[15] MlRKIN B. Mathematical Classification and Clustering. N. Y.: Kluwer Acad. Publ., 1996.

[16] PRIM E. Shortest connection network and some generalization // Bell System Technical J. 1957. Vol. 36. P. 1389-1401.

[17] SPAETH H. Cluster-Formation und Analyse. Oldenbourg Verlag, 1983.

[18] VANNAHME I. M. Clusteralgorithmen zur Mathematischen Simulation von Bausparkollektiven: doctoral thesis. Cologne: Univ. of Cologne, 1996.

[19] Weber G.-W. Mathematische Optimierung in Finanzwirtschaft und Risikomanagement — diskrete, stetige und stochastische Optimierung bei Lebensversicherungen, Bausparverträgen und Portfolios: Lecture held at Chemnitz Univ. of Technology, summer semester 2001.

[20] Wu C., HORNG S., TSAI H. Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses // J. Parallel and Distributed Computing. 2000. Vol. 60. P. 1137-1153.

Received for publication July 25, 2001

i Надоели баннеры? Вы всегда можете отключить рекламу.