Integrating Fuzzy c-Means Clustering with PostgreSQL*
R. M. Miniakhmetov tavein@gmail. com South Ural State University
Abstract. Many data sets to be clustered are stored in relational databases. Having a clusterization algorithm implemented in SQL provides easier clusterization inside a relational DBMS than outside with some alternative tools. In this paper we propose Fuzzy c-Means clustering algorithm adapted for PostgreSQL open-source relational DBMS.
Keywords: fuzzy c-means; fcm; fuzzy clustering; postgresql; integrating clustering; relational dbms.
1. Introduction
Integrating clustering algorithms is a topic issue for database programmers [1]. Such an approach, on the one hand, encapsulates DBMS internal details from application programmer. On the other hand, it allows to avoid overhead connected with export data outside a relational DBMS. The Fuzzy c-Means (FCM) [2], [3], [4] clustering algorithm provides a fuzzy clustering of data. Currently this algorithm have many implementations on a high-level programming languages [5], [6]. For
implementation the FCM algorithm in SQL we choose an open-source PostgreSQL DBMS [7].
The paper is organized as follows. Section 2 introduces basic definitions and an overview of the FCM algorithm. Section 3 proposes implementation of the FCM in
* This paper is supported by the Russian Foundation for Basic Research (grant No. 09-07-00241-a).
SQL called pgFCM. Section 0 briefly discusses related work. Section 0 contains conclusion remarks and directions for future work.
2. The Fuzzy c-Means Algorithm
The K-Means [8] is one of the most popular clustering algorithms, it is a simple and fairly fast [9]. The FCM algorithm generalizes K-Means to provide fuzzy clustering, where data vectors can belong to several partitions (clusters) at the same time with a given weight (imembership degree). To describe FCM we use the following notation:
• d G N - dimensionality of a data vectors (or data items) space;
• / e N :1 < / < 6/ - subscript of the vector's coordinate;
• n G N - cardinal number of training set;
• J cRJ - training set for data vectors;
• / G N :1 < / < // - vector subscript in a training set;
• x!; G X - the i -th vector in the sample;
• k G N - number of clusters;
• N : 1 < < k - cluster number;
• C c: Rtxd _ matrix with clusters' centers (centroids);
• c, gRj - center of cluster /, d -dimensional vector;
J J
• xu ,Cj{ G R - / -s coordinates of vectors x and c, respectively;
• U ci R” - matrix with membership degrees, where uH G R: 0 < uu < 1 is a membership degree between vector X( and cluster j ;
• p(xi, c ,) - distance function, defines a membership degree between vector x and cluster j ;
• /7? G R : m > 1 - the fuzzyfication degree of objective function;
• JFCM - objective function.
The FCM is based on minimization of the objective function JFCM :
N k
JPCMiXXm) = JJu;P2^c]) (0
*=1 3=1
Fuzzy clusterization is carried out through an iterative optimization of the objective function (1). Membership matrix U and centroids 6'(, are updated using the following formulas:
"» =
I
t=l
P(Xj,Cj)
p(x„ct)
1—m
(2)
uv ■xa
Y/',/ cfl=^
Z.
ut
i=1
(3)
Let s is a number of iteration, uH and uH are elements of matrix U on steps
v v
s and 5 + 1 respectively, and s e (0,1) c R is a termination criterion. Then the termination condition can be written as follows:
max{ | u-j+1) ~ ti-p |} < s (4)
V
Objective function (1) converges to a local minimum (or a saddle point) [10].
Input: X,k,m,s Ouput: U
1.5:= 0,U(0) := (ui}.) {initialization}
2. repeat
{computation of new centroids' coordinates}
3. Compute C(s) := (c ,) using formula (3) where utj g U(s> {update matrixes values}
4. Compute U(s) and tIis 1' using formula (2)
5. 5 := 5 + 1
6. until max{| —u^ 1} |} < s
v
Fig. 1. The Fuzzy c-Means Algorithm.
Algorithm showed on Fig. 1 presents the basic FCM. The input of algorithm receives a set of data vectors X = (x,, x2,..xn). number of clusters k,
fuzzyfication degree in. and termination criterion s. The output is a matrix of membership degrees U.
3. Implementation of Fuzzy c-Means Algorithm using PostgreSQL
In this section we suggest pgFCM algorithm as a way to integrate FCM algorithm with PostgreSQL DBMS.
3.1 General Definitions
To integrate FCM algorithm with a relational DBMS it is necessary to perform matrixes U and X as relational tables. Subscripts for identification elements of
relational tables are presented in Table 1 (numbers /7, k, d are defined above in a section 2).
Table 1. Data Elements Subscripts.
Subscript Range Semantics
i 1 ,n vector subscript
j 1 ,k cluster subscript
I l,d vector's coordinate subscript
As a function of distance p(xj, c ,). without loss of generality, we use the Euclidean metric:
P(x, ’Cj) = (5)
To compute the termination criterion (4) we introduce the function 8 as follows:
8 = max{ | |} (6)
V
3.2 Database Scheme
Table 2 summarizes database scheme of pgFCM algorithm (underlined columns are primary keys).
Table 2. Relational Tables of pgFCM Algorithm
No. Table Semantics Columns Number of rows
1 SH training set for data vectors (horizontal form) i_,xl,x2,...,xd n
2 sv training set for data vectors (vertical form) i, /, val n-d
3 c centroids' coordinates jXval k-d
4 SD distances between x, and c, ‘ J i, j, dist n-k
5 u degree of membership vector X( to a cluster cj on step s hjjVal n-k
6 UT degree of membership vector X( to a cluster c . on step 5 + 1 ijJjVal n-k
7 p result of computation function 5 (6) on Step 5 d, k, n, 5, delta 5
In order to store sample of a data vectors from set X it is necessary to define table SH (/, xl, x2,..., xd) . Each row of sample stores vector of data with dimension d and subscript i. Table SH has n rows and column i as a primary key. FCM steps demand aggregation of vector coordinates (sum, maximum, etc.) from set X . However, because of its definition, table SH does not allow using SQL aggregation functions. To avoid this obstacle we define a table SV(i, /, val). which contains n-d rows and have a composite primary key (/',/)• Table SV
represents a data sample from table SH ans supports SQL aggregation functions max and sum. Due to store coordinates of cluster centroids temporary table C( /, /, val) is defined. Table C has k ■ d rows and the composite primary key
( /, /). Like the table SV, structure of table C allows to use aggregation functions. In order to store distances p(xi, c ,) table Sl)(i, j, dist) is used. This table has n-k rows and the composite primary key (i,j). Table U(i,j,val) stores membership degrees, calculated on 5-th step. To store membership degrees on 5 + 1 step similar table UT(/, j, val) is used. Both tables have n ■ k rows and the
composite primary key (i,j). Finally, table l)(d,k,n,s,della) stores iteration number s and the result of computation function (6) delta for this iteration number. Number of rows in table P depends on the number of iterations.
3.3 The pgFCM Algorithm
The pgFCM algorithm is implemented by means of a stored function in PL/pgSQL language. Fig. 2 shows the main steps of the pgFCM algorithm.
Input: SH ,k,m,eps
Ouput: U
{initialization}
1. Create and initialize temporary tables (U,P, SV, etc.)
2. repeat {computations}
3. Compute centroids coordinates. Update table C.
4. Compute distances p(xj, c ,). Update table SD.
5. Compute membership degrees I IT = (liL ,).
{update}
6. Update tables P and U.
{check for termination}
7. until P.delta > eps
Fig. 2. The pgFCM Algorithm.
The input set of data vectors X stored in table SH . Fuzzyfication degree m , termination criterion s, and number of clusters k are function parameters. The table U contains a result of pgFCM work.
3.4 Initialization
Initialization of tables SV, U, and P provided by SQL-code II, 12, and 13 respectively. Table SV is formed by sampling records from the table SH .
II: INSERT INTO SV
SELECT SH.i, 1, xl FROM SH;
INSERT INTO SV
SELECT SH.i, d, xd FROM SH;
For table U a membership degree between data vector X( and cluster j takes 1 for all i = j .
12: INSERT INTO U (i, j, val)
VALUES (1, 1, 0);
INSERT INTO U (i, j, val)
VALUES (j, j, 1);
INSERT INTO U (i, j, val)
VALUES (n, k, 0);
In other words, as a start coordinates of centroids, first d data vectors from sample are used.
V i = j uH = 1 => c, = x,
J U J 1
When initializing the table P , the number of points k is taken as a parameter of the function pgFCM . A data vectors space dimensionality d and a cardinal number of the training set II also provided by function pgFCM parameters. The iteration number 5 and delta initializes as zeros.
13: INSERT INTO P(d, k, n, s, delta)
VALUES (d, k, n, 0, 0.0);
3.5 Computations
According to Fig. 2, the computation stage is splitted to the following three substeps: computation coordinates of centroids, computation of distances, and computation membership degrees, marked as Cl, C2, and C3 respectively.
Cl: INSERT INTO C
SELECT Rl.j, Rl.l, Rl.sl / R2.s2 AS val FROM (SELECT j, 1, sum(U.valAm * SV.val) AS si FROM U, SV WHERE U.i = SV.i GROUP BY j, 1) AS Rl,
(SELECT j, sum(U.valAm) AS s2 FROM U
GROUP BY j) AS R2 WHERE Rl.j = R2.j;
C2: INSERT INTO SD
SELECT i, j, sqrt(sum((SV.val - C.val)A2))) AS
dist
FROM SV, C WHERE SV.l = C.l GROUP BY i, j;
Through the FCM, computations of the distances provide by formula (2). In formula (3) the fraction's numerator does not depend on t, then we can rewrite this formula as follows:
2
=p1~m(xi,c)-
r k J_
Vt=l J
(7)
Thus, the computation of membership degrees can be written as follows:
C3: INSERT INTO UT
val
FROM (SELECT i, 1.0 / sum (dis tA (2 . 0 A (m-1. 0 ) ) ) AS
den
FROM SD
GROUP BY i) AS SD1, SD WHERE SD.i = SDl.i;
3.6 Update
Update stage of the pgFCM modifies P and U tables as shown below in ui and U2 respectively.
Ul: INSERT INTO P
SELECT L.d, L.k, L.n, L.s + 1 AS s, E.delta
FROM (SELECT i, j, max(abs(UT.'val - U.val)) AS
delta
FROM U, UT
GROUP BY i, j) AS E,
(SELECT d, k, n, max(s) AS s FROM P
GROUP BY d, k, n) AS L) AS R
Table UT stores temporary membership degrees to be inserted into table U . To provide the rapid removal all the table U rows, obtained at the previous iteration, we use the truncate operator.
U2: TRUNCATE U;
INSERT INTO U SELECT * FROM UT;
3.7 Check
This stage is the final for the algorithm pgFCM. On each iteration the termination condition (4) must be checked.
To implement the check, the result delta of the function (6) from table P is stored in the temporary variable tmp.
CHI: SELECT delta INTO tmp
FROM P,(SELECT d, k, n, max(s) AS max s FROM P
GROUP BY d, k, n) AS L WHERE P.s = L.max_s AND P.d = L.d AND P.к = L.k AND P.n =L.n;
After selecting the delta, we need to check the condition 8 < S. Then if this condition is true we should stop, otherwise, work will be continued.
CH2: IF (tmp < eps) THEN
RETURN;
END IF;
The final result of the algorithm pgFCM will be stored in table U .
4. Related Work
Research on integrating data mining algorithms with relational DBMS includes the following. Association rules mining is explored in [11]. General data mining primitives are suggested in [12]. Primitives for decision trees mining are proposed in [13].
Our research was inspired by papers [1], [14], where integrating K-Means clustering algorithm with relational DBMS, was carried out. The way we exploit is similar to mentioned above. The main contribution of the paper is an extension of results presented in [1], [14] for the case where data vectors may belong to several clusters. Such a case is very important in problems connected with medicine data analysis [15], [16]. To the best of our knowledge there are no papers devoted to implementing fuzzy clustering with relational DBMS.
5. Conclusion
In this paper we have proposed the pgFCM algorithm. pgFCM implements Fuzzy c-Means clustering algorithm and processes data stored in relational tables using PostgreSQL open-source DBMS. There are following issues to continue our research. Firstly, we plan to investigate pgFCM scalability using both synthetical
and real data sets. The second direction of our research is developing a parallel version of pgFCM for distribution memory multiprocessors.
References
[1] C. Ordonez. Programming the K-means clustering algorithm in SQL. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 823-828.
[2] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 1999, Vol. 31, iss. 3, pp. 264-323.
[3] J. C. Dunn. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics, 1973, Vol. 3, Iss. 3, pp. 32-57.
[4] J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell, USA, 1981, p. 256.
[5] E. Dimitriadou, K. Homik, F. Leisch, D. Meyer, and Weingessel A. Machine Learning Open-Source Package ‘r-cran-el071’, 2010. http://cran.r-project.org/web/packages/el071/index.html. Reference date: 13.06.2011.
[6] I. Drost, T. Dunning, J. Eastman, O. Gospodnetic, G. Ingersoll, J. Mannix, S. Owen, and K. Wettin. Apache Mahout, 2010.
https://cwiki.apache.org/confluence/display/MAHOUT/Fuzzy+K-Means. Reference date: 13.06.2011.
[7] M. Stonebraker, L. A. Rowe, and M. Hirohama. The Implementation of POSTGRES. IEEE Transactions on Knowledge and Data Engineering, March 1990, Vol. 2, Iss. l,pp. 125-142.
[8] J. B. MacQueen. Some Methods for Classification and Analysis of MultiVariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, Vol. 1, pp. 281-297.
[9] P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling Clustering Algorithms to Large Databases. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, 1998, pp. 9-15.
[10] J. Bezdek, R. Hathaway, M. Sobin, and W. Tucker. Convergence Theory for Fuzzy c-Means: Counterexamples and Repairs. IEEE Transactions on Systems, Man and Cybernetics, 1987, Vol. 17, Iss. 5, pp. 873-877.
[11] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: alternatives and implications. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 1998, pp. 343-354.
[12] J. Clear, D. Dunn, B. Harvey, M. Heytens, P. Lohman, A. Mehta, M. Melton, L. Rohrberg, A. Savasere, R. Wehnneister, and M. Xu. NonStop SQL/MX primitives for knowledge discovery. Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery and data mining, 1999, pp. 425—429.
[13] G. Graefe, U. M. Fayyad, and S. Chaudhuri. On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases. Proceedings of the 4tli International Conference on Knowledge Discovery and Data Mining, 1998, pp. 204-208.
[14] C. Ordonez. Integrating K-Means Clustering with a Relational DBMS Using SQL IEEE Transactions on Knowledge and Data Engineering, 2006, Vol. 18, Iss. 2, pp. 188-201.
[15] A. I. Shihab. Fuzzy Clustering Algorithms and their Applications to Medical Image Analysis. PhD thesis, University of London, 2000.
[16] D. Zhang and S. Chen. A Novel Kemelized Fuzzy c-Means Algorithm with Application in Medical Image Segmentation. Artificial Intelligence in Medicine, 2004, Vol. 32, pp. 37-50.