Svetlana V. Popova Ivan A. Khodyrev
y^K 004.91+519.237
NARROW-DOMAIN SHORT TEXTS CLUSTERING ALGORITHM
Abstract
In this paper, we describe the algorithm of narrow-domain short texts clustering, which is based on terms' selection and modification of k-means algorithm. Our approach was tested on collections: CICling - 2002 and SEPLIN-CICling. Results of tests and conclusions are presented.
Keywords: information retrieval, texts clustering, narrow-domain short texts clustering, k-means, genetic algorithm.
1. INTRODUCTION
In the focus of our attention is the task of narrow-domain short texts clustering (we will use shorter term «N-Dst clustering» or «N-Dst» bellow in the article). This research topic is actual now, especially in the field of automated text processing, because of three factors: a practical necessity, difficulty of task, and a small number of research papers. Today most important role in this field is played by Paolo Rosso, Alexander Gelbukh, David Pinto, Mikhail Al-exandrov, Marcelo Errecalde, Diego Ingaramo, Leticia C. Cagnina, Fernando Perez-Tellez, John Cardiff and others. Most of these authors conclude that N-Dst clustering problem is difficult, not well researched and there is much work to do [2, 7, 9].
Results of N-Dst could be used in different ways: searching scientific abstracts, analysis of news articles and any other kind of media sphere, like blogs for example. Clustering abstracts is important to reduce time, spent on search of useful articles in a particular domain. Clustering news about the same topic when the new information about the same event is searched is also a promising task. For example usually in a news flow the same information about the event is repeated in many sources, and new knowledge appears alongside with the
© Svetlana V. Popova, Ivan A. Khodyrev, 2011
old one, thus it could be a challenging task to retrieve really new information from the flow. Proposed technique also could be used for monitoring the reactions and behavioral correlations in social media sphere: blogs, forums, tweets etc. to some influence factor. The factor could be a concrete event or a timed trend of some indicator. One more area where N-Dst could be used is processing of sociological research results (responds, recommendations, and essays on a given topic).
Clustering narrow-domain short texts differs from large texts clustering, because the frequency analysis which is a common technique to work with big texts is not applicable for small ones due to the sparse data.
2. ALGORITHM DESCRIPTION
The goal of an algorithm is to obtain a set of clusters with texts assigned to them, which reflect the topic structure of a source narrowdomain short text collection. At present time basic version of algorithm is developed and implemented. Our approach has two steps:
1) Terms selection and building a set of significant words, which will be used to characterize texts (dimension reduction, collection vocabulary reduction)
2) Clustering, using keyword set from 1.
2.1. TERMS SELECTION
Let T = {t.} — be a set of all words in a
v 1 7 1=1, n
collection; D = {d } is a set of all texts in a
1 J'j=u
collection. Clustering in n-dimension vector space was taken as a basis. For narrow domain collections words with the highest occur frequency are less significant for clustering, thus they should be filtered. Also the important task is to find words, which reflect the specific features of text groups inside collection. Such words we will call «significant». Usage of significant words helps to reduce the dimension to the size z < n, where z is a number of significant words 3 = {tf, tf,..., tf }teT,z<T|. After dimension reduction each text is presented with the binary vector: t3 :(v1;v2,...,v|3|), where [ = 1, t. e d.
v. = i-1-- . The clustering algorithm is
1 [= 0,t, e dj e e
a modification of k-means. To calculate distance in k-means we use Euclidean distance.
The choice of significant words is based on three hypothesis.
1. For narrow domain collections we assume that significant words for thematic document groups are not typical for the whole collection, but their placement in texts is near the words, which could be found in the most documents of the collection. This assumption is based on idea that words with high value of DF (Document Frequency) determine the context of the whole collection and words, which are placed near them, determine the nuances of theirs usage.
2. For short texts we assume that significant word t1 is often placed together with the word t2 and rarely far from it. Word t1 relates to the usage nuances of t2. This assumption reflects the idea that context-significant groups of words in short texts are often placed together and rarely separately.
3. We assume that semantic of a text group is determined by sets of words, which occur together in the group's texts.
Based on the mentioned assumptions the algorithm which finds significant words was
developed. It is divided into two stages, described below. First stage begins with the choice of words with highest value of DF(ti) = ^ boolean(ti e d.). Then indicator
d. eD
max(DF(ti)) is selected. Words tfreq are i
selected from the collection's vocabulary with DF(tfieq) > J. It is better that the number of these words is less than 5. Then information about word pairs is used to create set of words applicants. Using term «word pair» we mean a pair of words, which occur together at least in one text in a window of three words. From the set of all word pairs we choose only «good», meaning of which is described as follows.
Good pairs are chosen with the algorithm: let pair consists of two words t. and t.. m - is a num-
1 J
ber of pairs (ti, t) inside collection D. Choose a
1 J
word from a pair with the smallest document frequency rating DFmn = min(DF (t,), DF (t.)). DF
If m > DFmin--min then pair (t., t) is «good».
a J
Parameter a is set manually.
We have tested three models of choosing the set of words applicants P.
1. Create set of words, which occur near
each tfreq in the window of three. For the created set «good» pairs are found. Set of words applicants P consists of words, which are contained in at least one «good» pair.
2. For frequent words tfreq the «good» pairs
are found ( tfreq is one of two words in such pairs). Set of words applicants P consists of words, which are contained in at least one «good» pair.
3. Create set of words, which occur near
each tfreq in the window of three. All these words are considered words applicants and are included into P.
For first tests we used CICling - 20021 collection: worst results were obtained with the third model and best results with the first one. Thus we use first model for an algorithm and all subsequent experiments are made using it.
1 http://sinai.uiaen.es/timm/wiki/index.php/CICLing-2002 Clustering Corpus
A subset of collection's words P is an output of the first part. From these words applicants on the second stage, we choose significant words. Other words from collection which are not in P will not be used further.
Second stage involves choice from the set P groups of words with size b, so that all words in each group occur together often in some texts (in two or more). Word, which is placed in at least one such group is considered significant. Words, except significant, will not be used further. Value of b is set manually with respect to the size of P. We use genetic algorithm for this task. Genetic algorithm finds terms, which occur together in documents. Input parameter b is responsible for the minimal size of individual in population, where each individual is a group of terms grk. For example, if b = 4, then the result set GR will consist of word groups grk for which | grk | = 4 and which appear together in k texts. We define GR = [grk }ke{2D
and we can select word groups with different k. It was done for future research with different sized texts, however for short texts GR is taken as a whole.
Basic algorithm is a classical realization of genetic algorithm. It could be presented as a Gen(W, fsd,mmut, fkr,Ffit,FfT), where W
dimension of hypothesis, fsel - selection func- find one "
tion, mmut - mutation function, fkr - crossover function, Ffit - fitness function, FfJ^- target value for fitness function. W is defined with different combinations of words f. e T , | W | = 2| T| = [boolean},=tjt . On the first step set of individuals is generated: Wp c W, Wp = [wp }|w |=b . Every individual wp contain b
number of terms: w = [t.} —.
pl=b
. Then, for
iteration for all individuals of the current population Fj.it measure is calculated. Selection function fsel chooses only best FjU individuals for the new population from the existing one. Other individuals are replaced with the individuals, obtained with mutation and crossover. Usage of second stage improves results of the method overall.
2.2. MODIFICATION OF K-MEANS ALGORITHM
k-means algorithm with some modifications was chosen as a basement for clustering. During clustering the optimal number of clusters is usually unknown. Thus we decided to use volatile number of clusters, which is changed during the clustering process. This change is regulated by a number of rules.
1. On the first step algorithm defines one seed c1 randomly. Then distances p to c1 from all the clustered texts R = [pd }d eD are calculated. We take the biggest distance pmax and determine parameter 1, so that
1 e [Pmax - 3, Pmax ) and few texts dj e D
should have pe [1, pmax].
2. After 1 is found, a new seed c1 is defined randomly and distances pd to this seed are calculated. While iterating the texts, if we
g, ge{I,D|}
which wp fitness value FjU is calculated. In cur-
with pd > 1, then text dg
dg g
becomes a new seed c2.
3. Then new set of distances pdj is calculated, where pd is a distance of text d, to the
"j j
closest seed.
4. Each text, which has pd] £ 1 is placed into a cluster, formed by closest seed.
5. If there are texts d with pd > 1, then they are put into temporary set N. Next seed is defined as text from N with the biggest
rent realization Ffit is calculated as a number of distance pd: cnext = (dq | p^ = max(N)) . Af-texts, in which all terms, which create individ- ter it algorithm goes to step 3. M occur FT |. Mutation function ran- if there are no texts d — with pd >1,
q, qe{I,|D|} "dq '
then k-means' iterative algorithm of seed optimization is initiated. It stops when seeds do not change anymore or when it finds the text "q ^[hD} with pd > 1. In last case algorithm goes to step 5.
domly adds one or two terms to the individual with high value of F^. Crossover function randomly chooses m elements from the individual with high Ffit value and replaces them with m or m + 1 elements from another individual with high Fj-it. Algorithm is iterative one. On each
3. RESULTS OF TESTS 3.1. TEST COLLECTIONS
To test algorithms, based on analysis of narrow domain short texts, there exist a number of collections, such as CICling -2002, SEP-LIN-CICling, Hep-ex, KnCr corpus [10], Easy-Abstracts and others. Most of them could be found in Internet1. To test quality of clustering we use FM-measure based on F-measure:
G 2 • P • R
FM = y G max FH, where F = ' j v ,
i | D | j j j P + R
| G n Cj
P =- i j
IG,
| G n C |
R = 4^, G = G }iJs -
is an obtained set of clusters, C = {C.} — - set
^ Jjj =1,n
of classes, defined by experts. All test results of this paper are calculated using FM-measure. Results of clustering all mentioned collections are present in FM-measure, these results are published in [2, 4, 7]. For experiments we used collections CICling - 2002 and SEPLN-CICling. CICling - 2002 contains 48 short texts in the field of linguistics. «Golden standard» contains 4 groups of texts: Linguistic, Ambiguity, Lexicon and Text Processing. SEPLN-CICling contains 48 short texts, its «golden standard» contains 4 groups of texts: Morphological - syntactic analysis, Categorization of Documents, Corpus linguistics, Machine translation.
3.2. PARAMETERIZATION AND RESULTS
We have conducted a series of experiments, which goal was:
1) define the relation between parameters
a and b;
2) evaluate the necessity of the second stage from the first part of an algorithm, which precedes clustering.
We used CICling-2002 and SEPLN-CICling collections for these experiments. In each experiment algorithm was started thousand times. Results of clustering on each start were evaluated with FM-measure. Based on values of FM we find three indicators: FMmax - best
max
FM-measure value of the experiment, FMmin -worst value, FMavg - mean value.
avg
For collection CICling - 2002 we used parameter J = 29 (J = 30 is a maximum posmax
sible value for this collection). Other parameters we defined using the information, that the biggest impact on a result of clustering is made by a number of significant words. This information was obtained by testing. If a number of significant words is about 1 % of the whole collection's vocabulary, then the best results will not be reached, but the clustering quality will be still reasonable. If we increase the number of significant words to more than 2,5 %, then the best results for FM-measure could be received, but the average results become worse. Thus we defined parameters a and b so that the number of significant words lies in between 1 % and 2,5 % of initial collection's vocabulary. Increase of parameters a and b leads to reduced number of significant words found. It also increases quality of the significant words until their number is not less than 1-1.5 % of the vocabulary size. If value of a is big, the importance of parameter b lowers. With big values of both a and b, the result of significant words allocation is absent. We used b = 2 and b = 3 for CICling - 2002 collections. Results of testing with different values of parameter a are presented in a table 1 (* - number of significant words). It also contains the projection of genetic algorithm usage. For SEPLN-CICling we use b = 2 and b = 3, a = 5 and a = 4, J = 26 and J = 18 (Jmax = 27 is a
max
maximum possible value for this collection). We compare test results with results that were presented in another work [2] for CICling - 2002 in table 2 and for SEPLN-CICling in table 3 (K-Means [3], MajorClust [13], DBSCAN [6], CLUDIPSO [2]). We also compared results with the case when instead of genetic algorithm, precise method was used. It chooses all pairs and triplets of words, which occur together in more than one text. Set of significant words is built as an union of all obtained pairs, triplets. Result of algorithm's work with precise method is shown in the table 4.
Usage of modified k-means algorithm is possible because the number of words which
1 http://users.dsic.upv.es/grupos/nle/?file=kop4.php
present text vectors is small. Thus during first ample: from 9 to 11) and the value of parame-step of clustering distances from the seed to the ter 1 may be automatically selected. We rec-most distant texts have small variations (as ex- ommend to defined 1 by using one of the high-
Table 1. Best FM-measures for different values of parameter a
a = 1 a = 4 a = 5
FMaVg FMmin FM 1 Jrlmas FMavg FMmin FM 1 max FMavg FMmin FM 1 max
* 110 67 36
No avg 0,51 0,39 0,54 0,49 0,34 0,57 0,47 0,4 0,59
GA min 0,45 0,36 0,59 0,47 0,39 0,55 0,47 0,4 0,59
max 0,45 0,36 0,59 0,45 0,34 0,6 0,47 0,4 0,59
* 13-20 12-18
With avg 0,51 0,39 0,54 0,5 0,4 0,62 0,5 0,4 0,58
GA min 0,51 0,39 0,54 0,5 0,4 0,62 0,49 0,4 0,63
max 0,45 0,33 0,65 0,49 0,36 0,66 0,46 0,38 0,66
a = 6 a = 7 a = 8
FMmin FMavg FM i Jrlmas FMmin FMavg FM i max FMavg FMmin FM i max
* 26 15 10
avg 0,48 0,38 0,61 0,49 0,42 0,6 0,49 0,42 0,6
min 0,48 0,38 0,54 0,47 0,43 0,61 0,49 0,42 0,6
max 0,48 0,38 0,61 0,47 0,43 0,61 0,49 0,42 0,6
* 10-14 8
With avg 0,51 0,4 0,6 0,48 0,44 0,51 - - -
GA min 0,48 0,41 0,56 0,48 0,44 0,51 - - -
max 0,51 0,4 0,61 0,44 0,35 0,56 - - -
Table 2 (part 1). CICling - 2002: best FM-measures for different algorithms
FMavg FMmin FM
K-Means 0.45 0.35 0.6
MajorClust 0.43 0.37 0.58
DBSCAN 0.47 0.42 0.56
CLUDIPSO 0.63 0.42 0.74
Table 2 (part 2). CICling - 2002: best FM -measures for algorithm described here
FMavg FMavg FMmin FM
FMavg 0,51 0,4 0,6
FMmin 0,49 0,42 0,6
FM 1 JrJmax 0,49 0,36 0,66
Table 3 (part 1). SEPLN-CICling: best FM-measures for different algorithms
FMavg FMmin FM 1 JrJmax
K-Means 0.49 0.36 0.69
MajorClust 0.59 0.4 0.77
DBSCAN 0.63 0.4 0.77
CLUDIPSO 0.72 0.58 0.85
Table 3 (part 2). SEPLN-CICling: best FM-measures for algorithm described here
FMavg FMmin FM 1 JrJmax
FMavg 0,6 0,45 0,71
FMmin 0,58 0,52 0,67
FM 1 JrJmax 0,6 0,45 0,71
Table 4. CICling - 2002: best FM-measures for algorithm with precise method
FMavg FMaVg FMmin FM 1 JrJmax
FMaVg 0,49 0,4 0,61
FMmin 0,49 0,4 0,61
FM 1 JrJmax 0,49 0,4 0,61
est values of distances, obtained during first step of clustering; the number of texts with distances higher than l shouldn't exceed 7 (for these collections). Getting a small variation between the highest distances is possible if a small number of words for presenting texts vectors is use. Binary vectors that present texts shouldn't have much «0» («0» in text's vector means that this text doesn't contain concrete word). The first part (terms selection) of this algorithm filters out words with a low DF. Words that are selected in the first part of algorithm are context words, not just common. This is a result of a good pairs selection. However, during this selection some words with a single appearance in some of collection's texts could be added to a resulting set of words. This problem is solved with GA, which filters them. Thus, words that have a specific context and high DF are selected. This provides the result of the algorithm.
4. CONCLUSION
We assume, that initially adopted hypothesis move us in the right direction, because the
set of significant words, which is built as a result of first stage of an algorithm is informative and contains terms, which reflect the nuances of the texts. There are some examples of significant words here for CICling - 2002: base, corpu, lexic, select, paper, evalu, languag, word, larg, document, approach, differ, linguist, inform, kind, knowledg, mean, automat, system; and for SEPLING - CICling: clustering, based, linguistic, language, corpus, order, translation, important, computational, part, results, machine.
Genetic algorithm chooses about 50-70 % of words, found by precise method. Despite this, the clustering algorithm with the output of genetic algorithm gives better results. We assume that better results are obtained because GA uses random choice of objects to process. Probability that genetic algorithm will create a pair or triplet with some word rises with the rise of the probability that this word occur together with another word more than in one text. In other words GA filters random words, which occur in different texts without dependency to other terms, thus we can say that terms with-
out specific context of usage will be filtered by GA.
Proposed modification of k-means gives better results comparing with the non-modified version, but we believe that the clustering algorithm needs improvement. This is due to the specific features of narrow domain collections. When clustering narrow domain collections, most documents in the clustering area are placed very close to each other and even if there are 2 or 3 defined seeds with large relative distance between them, the border between clusters, which divides dense area of texts is more or
less illusory. We assume that to solve this problem, more complex algorithms, which measure increase and decrease of objects' densities in the clustering area, are required.
Words with low value of DF don't have enough significance for clustering and could be neglected, because the algorithm works with high DF value words. We also assume that context significant words are characterized by high DF value for such collections. We believe this is an interesting observation that requires further research and could lead to simplification and improvement of terms' selection procedure.
References
1. Alexandrov M., Gelbukh A., Rosso P. An approach to clustering abstracts / In Proceedings of the 10th International NLDB-05 Conference. Vol. 3513 of Lecture Notes in Computer Science. Springer-Verlag, 2005. P. 8-13.
2. Cagnina L., Errecalde M., Ingaramo D., Rosso P. A discrete particle swarm optimizer for clustering short-text corpora. In: BI0MA08, 2008. P. 93-103.
3. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze / Introduction to Information Retrieval, Cambridge University Press, 2008 // D0I= http://nlp.stanford.edu/IR-book/information-retrieval-book.html.
4. Errecalde M., Ingaramo D., Rosso P. A new AntTree-based algorithm for clustering short-text corpora. 2010 // D0I= http://users.dsic.upv.es/~prosso/resources/ErrecaldeEtAl JCST10.pdf.
5. Errecalde M., Ingaramo D. Short-text corpora for clustering evaluation / Technical report, LIDIC, 2008 // D0I= http://www.dirinfo.unsl.edu.ar/~ia/resources/shortexts.pdf.
6. Ester M., KriegelH., Sander J., Xu X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proc. of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996. P. 226-231 // D0I= http://ifsc.ualr.edu/xwxu/publications/kdd-96.pdf.
7. Ingaramo D., Pinto D., Rosso P., Errecalde M. Evaluation of internal validity measures in short-text corpora / In Proc. of the CICLing 2008 Conference. Vol. 4919 of Lecture Notes in Computer Science, Springer-Verlag, 2008. P. 555-567.
8. Makagonov P., Alexandrov M., Gelbukh A. Clustering abstracts instead of full texts / In Proc. of the Text, Speech and Dialogue 2004 Conference - TSD04. Vol. 3206 of Lecture Notes in Artificial Intelligence. Springer-Verlag, 2004. P. 129-135.
9. Pinto D. Analysis of narrow-domain short texts clustering. Research report for «Diploma de Estudios Avanzados (DEA)», Department of Information Systems and Computation, UPV, 2007.
10. Pinto D., Jimenez-Salazar H., Rosso P. Clustering abstracts of scientific texts using the transition point technique / In Proc. of the CICLing 2006 Conference. Vol. 3878 of Lecture Notes in Computer Science. Springer-Verlag, 2006. P. 536-546.
11. Pinto D., Rosso P. 0n the relative hardness of clustering corpora. In Proc. of the Text, Speech and Dialogue 2007 Conference - TSD07. Vol. 4629 of Lecture Notes in Artificial Intelligence. Springer-Verlag, 2007. P. 155-161.
12. Pinto D., Rosso P. KnCr: A short-text narrow-domain sub-corpus of Medline / In Proc. of TLH 2006 Conference, Advances in Computer Science, 2006. P. 266-269.
13. Stein B., Niggemann O. 0n the Nature of Structure and its Identification / In: Graph Theoretic Concepts in Computer Science. LNCS, № 1665, Springer, 1999. P. 122-134.
АЛГОРИТМ КЛАСТЕРИЗАЦИИ УЗКО-ТЕМАТИЧЕСКИХ КОЛЛЕКЦИЙ КОРОТКИХ ДОКУМЕНТОВ
Аннотация
В статье описывается алгоритм кластеризации узко-тематических коллекций коротких текстов, основанный на модификации алгоритма к-средних и предварительном сужениии пространства кластеризации. Предлагаемый подход был протестирован на коллекциях: С1СИп^ - 2002 и БЕРЬШ- С1С1гщ. Полученные результаты представлены в данной работе.
Ключевые слова: информационный поиск, кластеризация текстовых коллекций, узкотематические коллекции, короткие тексты, алгоритм к-средних, генетические алгоритмы.
Svetlana V. Popova,
Saint-Petersburg State University,
Saint-Petersburg State Polytechnic
University,
spbu@bk.ru,
Ivan A. Khodyrev,
Saint-Petersburg State Electrotechnical University,
kivan.mih@gmail.com
© Наши авторы, 2011. Our authors, 2011.