UDC 519.213
Вестник СПбГУ. Прикладная математика... 2018. Т. 14. Вып. 1
A. A. Rogov1, A. G. Varfolomeyev1, A. O. Timonin1, K. A. Proenga2
A PROBABILISTIC APPROACH TO COMPARING THE DISTANCES BETWEEN PARTITIONS OF A SET*
1 Petrozavodsk State University, 33, Lenin pr., Petrozavodsk, 185910, Russian Federation
2 Feedzai, Avenida D. Joao II, Lote 1.16.01 Piso 11, Lisbon, 1990-083, Portugal
This article describes and compares a number of classical metrics to compare different approaches to partition a given set, such as the Rand index, the Larsen and Aone coefficient, among others. We developed a probabilistic framework to compare these metrics and unified representation of distances that uses a common set of parameters. This is done by taking all possible values of similarity measurements between different possible partitions and graduating them by using quantiles of a distribution function. Let Xa be a quantile with a level for distribution function Fp (t) = P (p < t). Then if the proximity measurement p is not less than \a, we can conclude that a • 100% of randomly chosen pairs of partitions have a proximity measurement less than p. This means that these partitions can neither be considered close nor similar. This paper identifies the general case of distribution functions that describe similarity measurements, with a special focus on uniform distributions. The comparison results are presented in tables for quantiles of probability distributions, using computer simulations over our selected set of similarity metrics. Refs 9. Table 1.
Keywords: distance between partitions of a set, probabilistic approach, comparing the distances.
А. А. Рогов1, А. Г. Варфоломеев1, А. О. Тимонин1, К. А. Проенца2
ВЕРОЯТНОСТНЫЙ ПОДХОД К СРАВНЕНИЮ МЕР БЛИЗОСТИ МЕЖДУ РАЗБИЕНИЯМИ МНОЖЕСТВА
1 Петрозаводский государственный университет, Российская Федерация, 185910, Петрозаводск, проспект Ленина, 33
2 Фидзаи (Feedzai), Португалия, 1990-083, Лиссабон, проспект Д. Жуана II, 1.16.01, 11
В статье рассматривается ряд классических метрик (индекс сходства разбиений, предложенный Рандом; коэффициент Ларсена—Аоне и др.) между разбиениями одного множе-
Rogov Alekscmdr Aleksandrovisch — doctor of technical sciences, professor, head of department; rogov@psu.karelia.ru
Varfolomeyev Aleksey Gennadievich — PhD of physical and mathematical sciences, associate professor; avarf@petrsu.ru
Timonin Artem Olegovich — postgraduate student; timonin.artem@gmail.com Proenca Kseniya Aleksandrovna — data scientist; kseniya.proenca@gmail.com
Рогов Александр Александрович — доктор технических наук, профессор, заведующий каферой; rogov@psu.karelia.ru
Варфоломеев Алексей Геннадьевич — кандидат физико-математических наук, доцент; avarf@petrsu.ru
Тимонин Артем Олегович — аспирант; timonin.artem@gmail.com Проенца Ксения Александровна — отециалист по обработке и анализу данных; kseniya.proenca@gmail.com
* The work was supported by the Program of Strategic Development of Retrozavodsk State University within the framework of the implementation of a set of activities for the development of research activities for 2012-2016.
Работа выполнена при поддержке Программы стратегического развития Петрозаводского государственного университета в рамках реализации комплекса мероприятий по развитию научно-исследовательской деятельности на 2012-2016 гг.
© Санкт-Петербургский государственный университет, 2018
ства. Унифицированы формулы для их вычисления на основании одинаковых параметров. Разработан вероятностный подход к сравнению приведенных мер близости (сходства). Для этого требуется градуировка интервала возможных значений мер близости между возможными разбиениями с помощью квантилей функции распределения. Пусть Л а квантиль уровня а для функции распределения ^ (4) = Р (р < 4). Тогда, если мера близости р оказывается не меньше, чем Ла, можно сделать вывод, что а • 100% случайно выбранных пар разбиений имеют между собой меру близости меньше, чем р. Следовательно, их нельзя считать близкими или похожими. Получен общий вид функции распределения для приведенных мер близости. Подробно изучен случай равномерного распределения элемента разбиения в любой группе. Для ряда мер близости приведены таблицы квантилей функции распределения, которые были построены с помощью компьютерного моделирования. Библиогр. 9 назв. Табл. 1.
Ключевые слова: меры близости между разбиениями множеств, вероятностный подход, сравнение мер близости.
Introduction. The numerical comparison of disjoint set partitions, which we call clusters, is a well studied subject in the literature [1-5]. We consider three types of similarity measurements between clusters, following Meila [1]:
1) by checking if a given object belongs or not to each known cluster [2, 3];
2) by comparing clusters regarded as sets [4, 5];
3) by calculating the delta produced from moving an object between two partitions [1]. There is, however, no approach to precisely compare these similarity measurements,
since each of them has its own advantages and disadvantages [6]. This paper takes a step forward in this direction, comparing and relating different similarity measurements.
Similarity measurements of set partitions. Assuming we have a set with n elements and two disjoin non-empty partitions (clusters) of this set. Let's call mji a frequency of elements to belong to clusters with numbers j and l in the first and second partitions. The paper [1] proposes to express all measurements that compare such kind of subsets through these frequencies m^. Let's call mj* and m*i as marginal frequencies. Their values will be equal to the number of elements in the clusters with numbers j and l mentioned above with numbers j and l. The following relations stay for the frequencies introduced above:
mji = n, mji = mj*, ^2 mji = m*i. j,i i j
Calling matrix M a matrix consisting of the elements mjl. Then for two identical partitions in each row j and in each column l of the matrix M will be only one non-zero element on the main diagonal. When using additional values
T = $3 mji — n, S = 53 mj* — n, Q = 53 m*i — n j,i j i
then, in these terms, a partition similarity index, proposed by Rand [2], will be equal to
S + Q — 2T
R = i- , n i
n(n — 1)
and the proximity factor introduced in [3] appears as
T
F :
v^Q'
Let /i,i be some proximity measurement between two clusters from different partitions that contain the i-th element of the set. We call pk the mean of proximity measurements
between two partitions of a set splitter into k non-empty subsets, which is calculated by formula
En
- _ i= 1 № /o\
Pk - -• {¿)
n
The coefficient pk is calculated in the following way: each pair of clusters j and l from the first and second partitions is being compared as many times as the pair has common elements.
Introducing the notation n (j, l) for the proximity coefficient between clusters j and l allow us to write the proximity coefficient (2) as
Pk = -M(J, 0-
n j,i
Any of the proximity coefficient can be used as a measurement n (j, l) between two
sets [7], or, what is the same, two (0,1) vectors [8]. For example:
= -,
mjt + — mji
P2UJ) = —r^T—r, (3)
max (mj * + m* i)
mj* + m*i
Obviously, n (j,l) takes values from 0 to 1. The more matching elements are in sets, the closer these coefficients are to 1. Look like:
n jf mj* + m*i — mji
j,i
Pl = iy---, (5)
n max ( m« + m*i ) j,i
(6)
n m« + m*i j,i
The proposed similarity coefficient is the proximity measurement of weighted sum between all clusters from the first and second partitions. The corresponding intersection cardinalities are used as weights. The coefficients from the papers [4, 5] are the most similar to the proposed measurement. They are also calculated using a pairwise comparison of clusters, but the summation is unweighted and is not performed for all pairs of clusters. For example, the Larsen—Aone coefficient [5] is
L = max ¡3 (j,l).
Proximity measurements comparison. Regardless of which proximity measurements are used, the problem arises when determining which measurements values can be considered large (close to 1) and which ones should be considered small. The solution to this problem will answer the question if the difference between the partitions is significant
or appeared to be random. This article develops an approach relaying on a probabilistic model for generating partitions and is based in the previously described work from [7, 9]. This approach allows us to set a specific value — threshold to determine "big" and "small" values of the similarity measurement. If the value of the proximity measurements seldom appears to be the same or higher then it is considered "large". The opposite is also true: if values occur frequently, it is considered to be "small". The paper proposes a method for constructing quantitative estimates for the concepts "rarely" and "often" based on the probability distribution of the proximity measurements values.
We perform a random experiment that generates a pair of partitions. We also introduce a probability measurement for the set of outcomes of the experiment. Like this we obtain a probability distribution of the proximity measurements values. This lets us to perform a calibration of the possible values range using quantiles of the proximity measurements distribution function. Let Aa be a quantile with a level for distribution function Fp (t) = P (p <t). Then if the proximity measurement p is not less than Aa, we can conclude that a ■ 100% of randomly chosen pairs of partitions have a proximity measurement less than p. A similar approach was considered in the paper [9] when comparing distances between subsets, and in the paper [7] when comparing dendrograms.
The distance probability distribution in a general case. Let U = {u1, u2,...,un} be a set of n elements. X and Y are the two of its partitions, both consisting of k groups. We represent the partitions X and Y in the form of vectors x and y of dimension n. X is constructed according to the principle: xi = j if and only if ui belongs to the jth group. Y is being built in the same way. We name pj, i = 1,...,n, j = 1,...,k, the probability of appearance of the element ui in the j-th group. Then we can consider a random experiment that consists of n independent tests and in each test the element ui can appear in any partition group. It appears that each test can have k2 species Aiji = {xi = j, yi = l}, where i is the test number. Let I (A) be the indicator of the event A. Then mji = E=1 I (j, m* = Eh I (j, m*i = j £ti I (j.
We construct the set of events B = (A111 ,Aii2,...,Ankk), taking into account the condition that empty groups are not allowed. Each element can be exactly in one group in each of the two partitions. Then for each i e 1,...,n the condition ^Ej=11 (Aiji) = 1
is right, and the condition ^k=11 (Aiji) = n is true.
If I (Airt) = 1 is true, then for any j = r, l = t appears I (Aiji) = 0. There are (k—1)2 of such pairs of j and l, therefore yi e 1,...,n Ej=1 (1 — I(Aiji)) =(k — 1)2.
To guarantee the absence of empty groups, we introduce the conditions Vj : En=^j=11 (Aiji) > 1, Vl : En=^j=11 (Aiji) > 1. Thus, the set B of outcomes follows the conditions:
fA111 ,A112,...,An j j) ]
B =. Ej=1Ej=1 IfAiji) = 1 I
B =< Ej=1 Ej=1 f1 — I fAiji)) =(k — 1)2 >.
Vj : E n=1 E j=11 (Aiji) > 1, Vl : £ n=1 Ej=11 (j > 1 ,
As it was shown in the previous sections, different similarity coefficients between two set partitions are described as functions of mji, i. e. p (X,Y) = h (mji). To calculate the distribution function of the random variable p (X,Y), we can use the formula for conditional probabilities. We call H the event that the partition does not contain empty groups. Then
pf ivyw.im P({p(X,Y)<t}-H) P (p [X, Y) <t\H) =-pjH)-'
k k n .
P({p(X,Y) <t}-H)= ^ £ nnil ji)'.
TOjiGC (A111,...,Ankk)€B j=i 1 = 1 i=l
where
in k k I
mji e Z : mji > ^ mji = n>h (mji) < t f 5 i=i j=i i=i )
P (H) = P ({p (X,Y) < &>}■ H).
Then the distribution function of the random variable p (X, Y) can be written in the following form:
k k n . .. i.
F(t) = p({p(x,Y)<t}.H) = J±- ¿2 E nnn(^)i(A^
^ ' mjleC (A111 Ankk)eBj=i l=1 i=1
(7)
Let us consider a special case.
A special case. Uniform distribution. Let's call plj = Then the formula (7) takes the form
w-m E E ППП Г
-J— у у тт(Т)2 = _^у у (Г2-
P(H) 11 U/ Р(Н) ^ ^ \к
V ' mjiEC (A111,...,Ankk)eBi=1 V 7 V ' mjiEC (A111 ,...,Ankk)EB
PIH>= E E v,
mjieCi (A111 ,...,Ankk)eB
1 \ 2n
where
{n k k
mji e Z : mji > 0, ^ ^ ^ mji =
i=i j=i i=i
We developed the program to calculate the quantiles of the proximity measurement distributions using the formulas (3)-(5). The calculation we done using simulation modelling. The table presents the calculated quantile values of some values n and k with different a. In a rectangle corresponding to the same values of n and k, the measurements were calculated with formulas (1), (4)-(6). The upper left corner values were calculated using the formula (4); the upper right corner — the formula (5); the lower-left corner — formula (6) and the lower right corner was computed using the formula (1). The number of experiments was 10 000.
Table. Proximity measurements quantities for different a
а к\п 5 10 30
0.2 3 0.307 0.333 0.229 0.313 0.208 0.299
0.467 0.400 0.370 0.422 0.343 0.508
5 0.0 0.0 0.252 0.298 0.146 0.215
0.0 0.0 0.397 0.600 0.252 0.623
7 0.0 0.0 0.442 0.450 0.146 0.205
0.0 0.0 0.582 0.778 0.250 0.703
0.1 3 0.307 0.333 0.229 0.313 0.209 0.300
0.467 0.400 0.370 0.422 0.343 0.508
5 0.0 0.0 0.254 0.292 0.147 0.216
0.0 0.0 0.402 0.600 0.253 0.621
7 0.0 0.0 0.442 0.450 0.146 0.206
0.0 0.0 0.583 0.778 0.252 0.706
0.05 3 0.307 0.333 0.229 0.313 0.209 0.300
0.467 0.400 0.370 0.422 0.344 0.508
5 0.0 0.0 0.257 0.292 0.148 0.216
0.0 0.0 0.399 0.600 0.252 0.623
7 0.0 0.0 0.433 0.450 0.146 0.206
0.0 0.0 0.582 0.778 0.252 0.706
Conclusions. The introduced proximity measurements estimation allows us to evaluate values obtained on different sets and using different proximity measurements. For example, we assume that the similarity measurements p1 (X, Y) and p2 (A, B) are statistically close with precision e > 0 if the inequality | F1 (p1 (X, Y)) - F2 (p2 (A, B) | < e is true. References
1. Meila M. Comparing clusterings by the variation of information. Learning Theory and Kernel Machines. Lecture Notes in Computer ¡Science (Springer), 2003, vol. 2777, pp. 173—187.
2. Rand W. M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 1971, vol. 66, pp. 846—850.
3. Fowlkes E. B., Mallows C. L. A Method for comparing two Hierarchical Clusterings. Journal of the American Statistical Association, 1983, vol. 78, pp. 553—569.
4. Meila M., Heckerman D. An experimental comparison of model-based clustering methods. Machine Learning, 2001, vol. 42, pp. 9-29.
5. Larsen B., Aone C. Fast and effective text mining using linear time Document Clustering. Proceedings of the Conference on Knowledge Discovery and Data Mining, 1999, pp. 16-22.
6. Steel M. A., Penny D. Distributions of tree comparison metrics — Some new results. Systematic Biology, 1993, vol. 42, pp. 126-141.
7. Sidorov Y. V., Kirikov P. V., Rogov A. A. Sravnenie dendrogramm s ravnyim chislom vershin [Dendrograms comparison with an equal vertices number]. Scientific notes of Petrozavodsk State University. Series Natural and Technical Sciences. Petrozavodsk, Petrozavodsk State University Publ., 2011, no. 8, pp. 108-110. (In Russian)
8. Warrens M. J. On Robinsonian dissimilarities, the consecutive ones property and latent variable models. Advances in Data Analysis and Classification, 2009, vol. 3, pp. 169-184.
9. Varfolomeyev A. A., Kirikov P. V., Rogov A. A. Veroyatnostnyiy podhod k sravneniyu rasstoyaniy mezhdu podmnozhestvami konechnogo mnozhestva [Probabilistic approach to distances comparison between subsets of a finite set]. Scientific notes of Petrozavodsk State University. Petrozavodsk, Petrozavodsk State University Publ., 2010, no. 8, pp. 83-88. (In Russian)
For citation: Rogov A. A., Varfolomeyev A. G., Timonin A. O., Proenca K. A. A probabilistic approach to comparing the distances between partitions of a set. Vestnik of Saint Petersburg University. Applied Mathematics. Computer Science. Control Processes, 2018, vol. 14, iss. 1, pp. 14-19. https://doi.org/10.21638/11701/ spbu10.2018.102
Статья рекомендована к печати проф. А. П. Жабко. Статья поступила в редакцию 7 октября 2017 г. Статья принята к печати 11 января 2018 г.