Научная статья на тему 'From similarity to distance: axiom set, monotonic transformations and metric determinacy'

From similarity to distance: axiom set, monotonic transformations and metric determinacy Текст научной статьи по специальности «Математика»

CC BY
105
20
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
МЕТРИЧЕСКОЕ ПРОСТРАНСТВО / METRIC SPACE / АКСИОМЫ СХОДСТВА / SIMILARITY AXIOMS / НОРМАЛИЗАЦИЯ СХОДСТВА / SIMILARITY NORMALIZATION / METRIC DETERMINACY / ДЛИННЕЙШАЯ ОБЩАЯ ПОДПОСЛЕДОВАТЕЛЬНОСТЬ / LONGEST COMMON SUBSEQUENCE / МЕТРИЧЕСКАЯ ОПРЕДЕЛЁННОСТЬ

Аннотация научной статьи по математике, автор научной работы — Znamenskij Sergej V.

How to normalise similarity metric to a metric space for a clusterization? A new system of axioms describes the known generalizations of distance metrics and similarity metrics, the Pearson correlation coefficient and the cosine metrics. Equivalent definitions of order-preserving transformations of metrics (both monotonic and pivot-monotonic) are given in various terms. The metric definiteness of convex metric subspaces Rn and Z among the pivot-monotonic transformations is proved. Faster formulas for the monotonic normalization of metrics are discussed.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

От сходства к метрике: система аксиом, монотонные преобразования и метрическая определенность

Исследуется сохранение порядка преобразованиями произвольной метрики (сходства илирасстояния) в метрическое или полуметрическое пространство. Вводится система аксиом, по-новому объединяющая известные обобщения метрик расстояния и метрик сходства, коэффициент корреляции Пирсона и косинус угла между векторами. Сохраняющие порядок (как монотонные, так и стержнево-монотонные) преобразования метрик эквивалентно определяются в различных терминах. Метрическая определенность среди стержнево-монотонных преобразований выпуклых метрических подпространств и Z доказывается при условии выпуклости метрикиRn расстояния. Обсуждаются формулы ускоренной монотонной нормализации метрик сходства.

Текст научной работы на тему «From similarity to distance: axiom set, monotonic transformations and metric determinacy»

УДК 004.412

From Similarity to Distance: Axiom Set, Monotonic Transformations and Metric Determinacy

Sergej V. Znamenskij*

Ailamazyan Program Systems Institute of RAS Peter the First Street 4, Veskovo village, Pereslavl area,Yaroslavl region, 152021

Russia

Received 18.11.2017, received in revised form 22.12.2017, accepted 20.02.2018 How to normalise similarity metric to a metric space for a clusterization? A new system of axioms describes the known generalizations of distance metrics and similarity metrics, the Pearson correlation coefficient and the cosine metrics. Equivalent definitions of order-preserving transformations of metrics (both monotonic and pivot-monotonic) are given in various terms. The metric definiteness of convex metric subspaces Rn and Z among the pivot-monotonic transformations is proved. Faster formulas for the monotonic normalization of metrics are discussed.

Keywords: metric space, similarity axioms, similarity normalization, metric determinacy, longest common subsequence.

DOI: 10.17516/1997-1397-2018-11-3-331-341.

Following [9], we will call the "distance metric" a function that satisfies the axioms of a metric space, leaving the term "metric" free for use in a broad sense as a real-valued function of two variables.

The length of the longest common subsequence of two text strings (LCS) is а commonly used similarity metric. For example, it is only natural that the line b = "BRITAIN" appears to be more similar to gb = "GREAT BRITAIN" than to i = "IRAN" (l(b,gb) = 7 > 3 = l(b,i)) and the string rf = "RUSSIA" is more similar to rf = "RUSSIAN FEDERATION" than to u = "USA" (l(r,rf) = 6 > 3 = l(r,u)).

Usually, data clustering algorithms work in metric spaces. Known from [13, 26, 14, 17, 5, 35] formulae

2l(x,y) l(x,y) di(x,y) =...... d2(x,y) =

lxl + Ы' ' lxl + \y\- l(x,yY i , \ l(x,y)2 , , ч l(x,y)

d3(x, y) =iii I , dA(x, y) -

(1)

lxllyl lx

d5(x,y) = ,,, (2)

mm(M, \y\)

where \x\ = l(x,x) h \y\ = l(y,y), normalize LCS for use in clustering algorithms through conversion to metric or directly. Each of (1) turns the similarity order, so that "BRITAIN" becomes closer to "IRAN", and "RUSSIA" — to "USA". The turn prevents qualitative clustering.

Thought very old study of similarity metrization in psychology [29] showed the hardness of the problem, for formally specified similarity metrics including LCS, the problem came into consideration only in the last decade. Thought the subsequent studies [14] detected some LCS turn with d2 from (1) for some data set, it did not respond the emerging issues: — Why no right formula is known for this purpose?

* [email protected] © Siberian Federal University. All rights reserved

— What is the reason for such a systematic clustering error?

— What are the practically effective formulae with minimal turn?

Some clarification was provided in [31] noting in particular that "the trivial transformations of semimetric spaces into metric ones are not suitable for efficient similarity search" and discussing known approaches, without answers to the questions above.

The construction of a metric space avoiding the turn in similarity is used to analyze experimental data for more than half a century [28]. The success of the multidimensional scaling technology based on this construction [4] caused essential progress in the development of the ordinal embedding theory [19, 2]. These recent studies have theoretically confirmed the metric determinacy noted in applied research: the ordinary distance metric for a domain in 1" is uniquely determined, up to a constant multiplier, by order comparisons.

Our situation vary: the original is no longer the metric of the domain in 1", but the infinite-dimensional similarity metric, and not the set of compared objects should be transformed to the distance metric [9], but the similarity metric itself. We need to understand relations between the similarity metrics and distance metrics.

1. Similarity as a partial metric with minus sign

The non-negativity of distance is generally accepted. Similarity is usually evaluated with a non-negative number, so that zero means a complete lack of similarity. It is often convenient thought to consider the lack of a special similarity as zero similarity and use negative values for apparent opposites. The paper [20, 23] describes the use of Pearson correlation coefficient r as a similarity metric and its transformation to a distance metric. The cosine of the angle between vectors in Euclidean space and the distance with the minus sign s = —d are also used as a metrics of similarity s.

Denote by 1 = 1 U {—to, 1+ = {x e 1 : x> 0} and 10+ = {x e 1 : x > 0}.

Attempts to construct the axiom set for similarity metrics and dissimilarity metrics [32] showed the triangle inequality from the metric space definition to be not suitable for similarity metrics [33]. New form of the triangle inequality for the similarity metric in [9] reflects the similarity as a measure of coinciding content (i.e., the power of a set of common characteristics, the length of the longest common subsequence, the amount of general information etc.):

Definition 2 (Similarity Metric).^ Given aset X, a real-valued function s(x, y) on the Cartesian product X x X is a similarity

Compare this system of axioms with the system of axioms of partial metrics [22]. Partiall metric space is a generalization of metric space in which the elements can have non-zero dimensions.

DEFINITION 3.1: A^partial metric^ or pmetric [9] ^(pronounced "p-metric") is a

Adding the axiom Vx e U p(x, x) = 0 makes this system of axioms equivalent to the usual system of axioms of a metric space.

If we set s(x,y) = -p(x,y), then we see that axioms (P1), (P2), (P3), and (P4) are exactly the axioms 5, 3, 1, and 4. The remaining axiom 2 and similar later additions in the partial metric definition (i.e. [6]) reflect just the natural desire to avoid negative numbers. It should better to move it from axiom set to a set for metric values, usually either V = R0+ or V = [0,1]. The second letter may be uniquely associated with each of the suitable axiom:

Ухеи p(x, x) = 0 — shotness, thinness;

3(P2) — direction,small self-distances [6], self-similarity [14];

5(P1) — coincidence [25], nondegenerate [15], identity of indiscernibles [9], TO separation [11], strict positiveness [8]; 4(P4) — triangle inequality; 1(P3) — symmetry.

Let U be an arbitrary set, V с R. For distance f = d : U x U ^ V or for similarity f = s : U x U ^ V under the exception of the axiom of symmetry, the axiom of the direction becomes more complicated and the full list of axioms takes the form:

(h) Ухеи f (x,x)=0,

(i) Ух,уеи s(x,y) < min(s(x,x),s(y,y)) | d(x,y) > max(d(x, x), d(y, y)),

(0) yx,yeu f(x,y) = f(x,x) = f(y,y) y = x,

(r) Ух,у,ге_и s(x,z) + s(y,y) > s(x, y) + s(y,z) 1 d(x,z)+ d(y,y) < d(x, y) + d(y,z),

(y) yx,yeu f (x,y) = f (y,x).

Definition 1. Let U an arbitrary set, V с R and U = VиxU the set of all functions of two variables U with values in V and A = {h,i,o,r,y}. For any B с A denote Sim:B(U, V) CU the subset consisting of all functions s €U that satisfies all the axioms of B and Dist:B(U, V) C U the subset consisting of all functions d €U, satisfying all the axioms of B.

Corollary 1.

1. p is a partial metric on U in the sense of [22] if and only if p € Dist:iory(U, R).

2. p is a partial metric on U in the sense of [6] if and only if p € Dist:iory(U, R0+).

3. d is a prameric [3] U if and only if d € Dist:hi(U, R).

4. d is a semi-metric [1] U if and only if d € Dist:hioy(U, R).

5. (U, d) is a metric space if and only if when d € Dist:hiory(U, R).

6. d is a quasi-metric [36, 18] if and only if when d € Dist:hior(U, R).

7. d is a pseudo-metric [18] if and only if when d € Dist:hiry(U, R).

8. d is a pseudo-quasi-metric (p-q-metric) [18] if and only if when d € Dist:hir(U, R).

9. s is a similarity metric on U if and only if when s € Sim:hiory(U, R0+).

10. Sim:i(U, V) П Dist:i(U, V) consists of constants Sim:io(U, V) П Dist:io(U, V) = 0 for nontrivial U and V.

11. The Pearson correlation coefficient r and the cosine of the angle between the vectors belong to Sim:iory(R", R).

12. Увел s € Sim:B(U, V) ^ (-s) € Dist:B(U, V).

13. УВеСеЛ Sim:C(U, V) C Sim:B(U, V) and Dist:C(U, V) C Dist:B(U, V).

We see that the cone of partial metrics on U with values in R is a mirror reflection of the cone of similarity metrics and that the volumes of these concepts are reflected by the scheme in Fig. 1, in which a metric reversion d = - s looks as a central symmetry.

Definition 2. We call the metric s € Sim:i(U, V) to be a LCS-like similarity metric if it satisfies align-base axiom about the existence of common part:

(1) s(x, y) = sup{s(z, z): s(z, z) = s(x, z) = s(z, y), z € U}. and to be a Tversky similariy metric if common part always unique:

(v) Ух,уеи ^геи s(x,y) = s(z,z) = s(x,z) = s(z,y).

positive similarities

Similarity metrics

negative similarities

? negative distanses

constants

Fig. 1. Relations of distance and similarity metrics satisfying the direction axiom (i)

An important for the data analysis possibility of visual representation of the hierarchy of proximity with the tree phylogenetic tree or evolutionary tree [12] is provided by replacing with a stronger inequality, the additive inequality (same as the four points inequality)

(d) yx,y,u,veu s(x,y) + s(u,v) > min(s(x,u) + s(y,v), s(x,v) + s(u,y))

| d(x, y) + d(u, v) ^ max(d(x, u) + d(y, v), d(x, v) + d(u, y)),

or even more powerful the ultrametric inequality

(u) Vx,y,zeu s(x,z) > min(s(x,y),s(y,z)) 1 d(x, z) < max(d(x,y),d(y, z)).

All the definitions above remains valid for the extension A = {d,i,h,o,l,r,u,v,y}. Further investigation of this axiom system appears in [37].

2. Monotonic transformations

The axiom of direction allows clusterization algorithms to use closed and open balls with center at a fixed point a:

B(a, r) = {x £ U : d(a, x) — d(a, a) ^ r}, B(a, r) = {x £ U : d(a, x) — d(a, a) < r}

for similarity and respectively

B(a, r) = {x £ U : s(a, a) — s(a, x) ^ r}, B(a, r) = {x £ U : s(a, a) — s(a, x) < r}

for distance. In contrast to the usual formulae for the balls, these proposed in [27] (cited in [16]) guarantee the emptiness of balls with negative radii and the belonging of the center to balls with positive radii.

Proposition 1 (equivalence of left-monotonic relatedness definitions). The following conditions on si,s2 £ Sim:i(U, V) are equivalent:

1. The (non)strict inequalities between distances to the third point coincide:

Vx,y,zeu si(x,z) <si(y,z) ^^ s2(x,z) <s2(y,z); Vx,y,zeu si(x,z) > si(y,z) ^^ s2(x,z) > s2(y,z).

(3)

(4)

2. Open or closed balls with any given center differ only in radii:

V{j,j}={i,2} Vaex Vr>0 3r'>0 BSi (a, r) =BSj (a,r'); V{j,j}={!,2} Vaex Vr>0 3r> >0 BSi (a, r) =BS. (a,r').

(5)

(6)

3. Open or closed halph-spaces separating arbitrary a,b € U coincide:

{x € X : si(a,x) < si(b,x)} = {x € X : s2(a,x) < s2(b,x)}; (7)

{x € X : si(a,x) ^ si(b,x)} = {x € X : S2(a,x) ^ s2(b, x)}. (8)

Symmetric to this statement can be obtained by permuting the arguments of each s^ in Proposition 1. We call s^ to be pivot-monotonically related if they are both left-monotonically related, and right-monotonically related. The pivot-monotonic relatedness means exactly the preservation of the SimOrder (pre)order relation considered in [30] and subsequent studies.

The transformation of metrics we call pivot-monotonic if the image of any metric is pivot-monotonically related with its preimage (SimOrder preserving).

Proposition 2 (equivalence of monotonic relatedness [13] definitions). The following conditions on si, s2 € Sim:i(U, V) are equivalent:

1. The (non)strict inequalities between distances coincide:

Vx,y,u,veu si(x,y) < si(u,v) ^^ S2(x,y) < s2(u,v); (9)

Vx,y,u,veu si(x,y) > si(u,v) ^^ S2(x,y) > S2(u,v). (10)

2. Open or closed balls differ only in radii and inequalities signs always coincide:

V{i,j}={1,2} Vai,a2EB Vri>r2>0 >r'2 >0 Vk=i,2 BSi (ak ,rk) = BSj (ak, rk ); (11)

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

V{i,j}={i,2} Vai,a2EB Vn>r2>0 >r'2 >0 Vk=i,2 B si (ak ,rk) = Bsj (ak, r'k ). (12)

The transformation of metrics is called monotonic [13] if the image of any metric is mono-tonically related with its preimage.

Proof of propositions 1 and 2. (9) (10) is trivial.

(9) (11). Let r'k = inf sj(ak,x). Then

x</Bfi (ak ,rk)

x € Bsj (ak, rk) ^^ Vy € Bsi (ak ,rk) sj (ak ,x) < sj (ak ,y)

^^ (si(ak ,y) > rk sj (ak ,x) < sj (ak ,y))

^^ (si(ak ,y) > rk si (ak, x) <si(ak,y)).

The last condition obviously holds under si(a,x) < rk and fails under si(a,x) > rk. (9) ^^ (12). Let r' = sup sj (ak,x). Then

x£BSi (ak,rk)

x € Bsj (ak, rk) ^^ Vy € Bsi (ak,rk) sj (ak ,x) > sj (ak ,y)

^^ (si(ak,y) < rk Sj (ak ,x) > sj (ak ,y))

^^ (si(ak,y) < rk si (ak ,x) > si(ak,y)).

The last condition obviously holds under si(a,x) > rk and fails under si(a,x) < rk.

The proofs of (3) ^^ (5) and (3) ^^ (6) are the same as the above just without indexes k. The equivalences (3) ^^ (4), (3) ^^ (7), and (4) ^^ (8) are trivial. □

Corollary 2. Left-monotonic relatedness s,t € Sim:i(U, V) implies

Vx,y,z,eu s(x,z) = s(y,z) ^^ t(x,z)= t(y,z), (13)

and monotonic relatedness implies

Vx,y,u,veu s(x,y) = s(u,v) ^^ t(x,y)= t(u,v). (14)

3. Convexity of metric and monotonic determinacy

Corollary 3. For the metrics f1, f2 £ Sim:i(U, V) to be monotonically connected, it is necessary and sufficient that there exist a strictly increasing function $ defined on the image of f1, that

f2(x,y) = $(fi(x,y)) Vx,y £ U. (15)

Let's denote f (U x U) the set of all values of f.

Corollary 4. For the metrics f1, f2 £ Sim:i(U, V) to be pivot-monotonically connected, it is necessary and sufficient that there exist a strictly increasing over second argument function p : U x fi(U x U) ^ V, that

f2(x,y) = <fip(x,fl(x,y)) Vx,y £ U. (16)

We call a metric transformation to be monotone if the image of any metric is monotonically related to its preimage.

The monotone transformation of metrics by the formula (15) is named in [30] SP-modification, and the function $ in (15) is SP-modifier. We restrict our discussion to the case V = R, since V c R is mainly used, from which $ can be extended to all R. The set of all SP-modifiers forms a partially ordered group (po-group) G = Aut(R) of isomorphisms of linear ordered set R. The group operation in it is a superposition of functions $ and the unit 1S is the function $(x) = x. and the positive cone P consists of all concave functions on R that are different from linear functions. The cone P accurately characterizes those monotonic metric transformations that always preserve the triangle inequality [10] but differs from multiplication to a constant.

A partially ordered group G defines on U a relation of strict partial order >-T by the rule t >-P s Vx,yeU t(x,y) = $(s(x,y)). If t and s here are distance metrics, then t is

named in [30] triangle-generating modification or TG-modification of metric s.

The strict partial order induces the preorder t ^ s (s u t u) and

the equivalence relation t s (t ^ s)&(s ^ t), that mean coincidence up to an affine

transformation t(x,y) = cs(x,y) + b with the appropriate constants c > 0 and b. The constant term b disappears if the axiom of shortness (h) is satisfied.

Note that d2 in (1) is monotonically related to the Levenshtein distance, and that the examples of geographical names cited at the beginning of the article clearly indicate the particularity of the Levenshtein formula, commonly used in data cleansing applications [21, 34] in comparison with the LCS similarity metric or other similarity metric [38].

Clear statement of our problem requires a criterion for the optimality of the distance metric. For this purpose [7] suggest to use the intrinsic dimensionality calculated through the math-

u?

ematical expectation u? and the variance a2d of prametric d by the formula IDimM(d) = —d.

2ad

Comparative review of other definitions for intrinsic dimensionality with computer evaluation can be found in [24].

Let Nfc = {0,..., k +1} be an integer segment with the metric dNk (m,n) = \n — m|. It seems intuitively plausible that the metric dNk has the smallest internal dimension among all pivot-monotonically equivalent metrics.

None of the approaches to determining the intrinsic dimension considered in [24] succeeded to prove or disprove this assertion. Most of them assume the assignment of a probability measure U on U x U, which is quite natural for a data set. On one can use the uniform probability measure.

As an alternative, consider the notion of convexity that allocates this space. The convexity of a metric space is intuitively associated with the presence of a segment with arbitrary ends. The convexity of a metric space is intuitively associated with the presence of the segment free

ends. Usually the definition of convexity requires the existence of midpoint of a segment and the consistent application of this definition gives a dense on the segment set of points. To discrete metric spaces such a definition is inapplicable and an alternative is required.

Definition 3. We call the metric d € Dist:i(U, V) or s = —d to be convex if for any x, z € U and t € V inside (d(x, x), d(x, z)) there exists y € U, for which d(x, y) = t and the triangle inequality for x, y, z turns to equality.

Proposition 3. Let d € Dist:iy(U, Nk) is convex and d(x, y) = k. Then there exists an isometric inclusion ^ : Nk ^ U with the ends ^(0) = x u ^(k) = y.

Proof. We use induction on k. For k = 1, by assumption, there exist (x,y) € U such that d(x, y) = 1. Assuming ^(0) = x and ^(1) = y, we obtain the desired isometry.

Suppose that the assertion is proved for k = n. By hypothesis, for k = n + 1 there exist (x, y) € U such that d(x, y) = k. Using the convexity of the metric with t = n, we obtain m € U for which d(x, m) = t and d(m, y) = d(x, y) — t. Applying the induction hypothesis to the points x, m of an open ball of radius k with the center at x, we obtain a map that remains to be extened by the equality ^(k) = y. In this case, the equalities d(^(i),y) = ki follow from the triangle inequalities d(^(0), ^(i)) + d(^(i), y) ^ d(x, y) = k and d(^(i), n) + d(^(n), y) = n — i + 1 ^ < d(^(i),y). □

Theorem 1. Suppose that two convex pseudometrics are monotonically connected and the set of values of one of them is an arithmetic progression or is closed with respect to addition or is closed with respect to the calculation of the half-sum. Then they differ by multiplication by a constant.

Note that the multiplication of a metric by a constant does not change its intrinsic dimension and convexity.

Proof. The case of an arithmetic progression follows directly from the Proposition 3. Let 0 <

xi xi)

<xi = f(ai,bi) <x2 = f(a2,b2). It is necessary to show that — = ——r.

x2 ¥>(x2)

For any k € N let nk < k^(xi) < nk + 1. Let's consider the equivalent inequality:

V(x2)

nk¥>(x2) < ky(xi) < (nk + 1)y(x2). (17)

1. Closedness with respect to the operation of addition allows us to apply the Proposition 3

f(x, y) f(x, y)

with an arbitrarily large k to the functions--—, g(x, y) as well as to functions--—, g(x, y)

xi x2

and obtain

^(nkx2 ) < ¥>(kxi) < y((nk + 1)x2 ) .

Applying the monotonicy of y>, we get nkx2 < kxi < (nk + 1)x2 and it remains to pass to the

nk xi nk + 1 limit in — < — < —--.

k x2 k

2. Closedness with respect to the mean calculation also allows us to combine two sufficiently long arithmetic progressions of the values, constructing them by midpoint selections.

Fix 2m > r + nk + k and using the middle point selections construct in f (X x X) two grids of size 2m with the endpoints xi0 = 2-mxi and x20 = 2-mx2 respectively. Then (17) gives

2-mnk<f(x2) < 2-mMxi) < 2-m(nk + 1)p(x2).

In contrast to the proof of the previous case, here we have to apply the Proposition 3 to each part also in the opposite direction with a 2m grid to get

p(2-mnkxd) < 2-mkxi) < v(2-m(nk + 1)xd).

This and the monotonicity of ^ imply the equality nkx2 < kx1 < (nk + 1)x2 and it remains to

, , nk xi nk + 1

pass to the limit in — < — < —-—. □

k x2 k

Corollary 5 (monotonic determinacy of metrics). The statement "If the monotonous transformation of the distance metric is convex, then it is multiplication by a suitable positive constant" is true for each of the following metric spaces: Zn, N, Nk, an arbitrary convex subset of Rn.

4. Monotonic normalization

Proposition 4. Let the metric of similarity s £ Sim:iy(U, R) satisfies 3x,y,zeU s(x,y) > s(z,z). Then no metric d £ Dist:hi(U, R0+) can be monotonically related to s.

Proof. 0 = d(z, z) > d(x,y). □

If we change the zero values, the situation will change: Proposition 5. Let the metric of similarity s £ Sim:iy(U, R0+). Then the formula

d(x,y) = 2 + 2 + s\x,y), x = y (18)

defines the distance metric d £ Dist:hiory(U, [0,1 ]), which satisfies the condition (9) of monotonic relatedness to s for all x = y,u = v £ U.

Unfortunately, Theorem 3 in [8] state that none current acceleration technology for the nearest neighbor search will be effective for this metric since all non-zero values are between 1 and 2. The formulae (1)-(2) for LCS also narrow the range of nonzero values, which usually leads to greater intrinsic dimension and consequently to smaller efficiency. The formula

d(x,y) = 1 — ^My, x = y (19)

also monotonically transforms to Dist:hiory(U, [0,1]) and defines the distance metric if M is large enough. It may be possible to decrease the intrinsic dimension selecting smaller M.

Proposition 6. Let the similarity metric s £ Sim:iry(U,R0+). Then the formula d(x,y) = = s(x,x) — s(x,y) defines p-q-metric d £ Dist:hir(U, R0+) to be left-monotonically related with s.

However, to make it at least pseudometric, we need some symmetrized function $(s,t). This function should be convex if s is convex to preserve the triangle inequality and should be close to linear to avoid high intrinsic dimension.

Uzing a segmnet [e, X] c R0 containing all possible positive value of s on the data set, it is possible to limit values of d by the [0,1] segment:

d(x, y) = X-1$(s(x, x), s(y, y)) — s(x, y). (20)

Among the admissible functions, there are the largest

\ s(x,x) + s(y,y) — 2s(x,y) d(x,y) =-2X-, ()

intermediate

dp(x,y)=X 1 p(s(x,x))p + (s(y,y))p - s(x,y), d(x,y) =X-Ws(x,x)s(y,y) - s(x,y).

and the smallest

d(x,y)

£ + (min(s(x, x), s(y, y)) - s(x, y)) X + £ '

x = y.

(22)

(23)

(24)

To localize the turns, in any arbitrary triplet x, y, z € U, we rename the vertices so that s(x,x) ^ s(y,y) ^ s(z,z). The pivot-monotonicity is violated if $(s(y,y),s(z,z)) — s(y,z) < < $(s(x, x), s(y, y)) — s(x, y) for s(x, y) < s(y,z). For d8 this means 2(s(y, z) — s(x, y)) > s(y, y) — —s(x, x), and the Levenshtein metric d6 turns more often: 2(s(y, z) — s(x, y)) > s(z, z) — s(x, x).

In each of these cases, rather natural restrictions on s provides the convexity of the metric so that Theorem 1 confirm the quality of the metric. The possible normalization formulae with some generic restrictions on usage are shown in Tab. 1.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Table 1. Expected performance of normalization formulae for convex similarities

Formula

Turns

Speed

Recommended applications

(18) (2) (1) (19)

(21)

(22),(23) (24)

never rare less rare never

rare

more rare most rare

slow slow

slow ?

faster

faster faster

none none none

cluster analisys of medium size data focused only on objects of high similarity

(such as correcting typographical errors) ?

common use

This work was performed under financial support from the Government, represented by the Ministry of Education and Science of the Russian Federation (Project ID RFMEFI60716X0153).

References

[1] C.Alexander, Semi-developable space and quotient images of metric spaces, Pacific J. Math, 37(1971), 277-293.

[2] E.Arias-Castro, Some theory for ordinal embedding, arXiv:1501.02861 [math.ST], 2015.

[3] A.V.Arkhangel'skii, L.S.Pontryagin, General Topology I: Basic Concepts and Constructions Dimension Theory, Springer, 1990.

[4] I.Borg, P.J.Groenen, Modern multidimensional scaling: Theory and applications, Springer, 2005.

[5] S.Budalakoti, R.Akella, A.N.Srivastava, E.Turkov, Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences, NASA/TM-2006-214553, September, 2006.

[6] M.Bukatin, R.Kopperman, S.Matthews, H.Pajoohesh, Partial metric spaces, Amer. Math. Monthly, 116(2009), no. 8, 708-718.

[7] E.Chavez, G.Navarro, A Probabilistic Spell for the Curse of Dimensionality. ALENEX'01, LNCS 2153, Springer, (2001), 147-160.

[8] E.Chavez, G.Navarro, R.Baeza-Yates, J.L.Marroquin, Searching in metric spaces, ACM Computing Surveys, 33(2001), no. 3, 273-321.

[9] S.Chen, B.Ma, K.Zhang, On the similarity metric and the distance metric, Theoretical Computer Science, 410(2009), no. 24-25, 2365-2376.

[10] P.Corazza, Introduction to metric-preserving functions, American Mathematical Monthly, 104(1999), no. 4, 309-323.

[11] M.M.Deza, E.Deza, Encyclopedia of Distances, Springer, 2009.

[12] A.J.Dobson, Unrooted Trees for Numerical Taxonomy, Journal of Applied Probability, 11(1974), no. 1, 32-42.

[13] N. J. P. vanEck, L.Waltman, How to Normalize Co-Occurrence Data? An Analysis of Some Well-Known Similarity Measures (No. ERS-2009-001-LIS), ERIM report series research in management Erasmus Research Institute of Management, Erasmus Research Institute of Management, 2009. Retrieved from http://hdl.handle.net/1765/14528.

[14] C.H.Elzinga, M.Studer, Normalization of Distance and Similarity in Sequence Analysis, LaCOSA II, Lausanne, June 8-10, 2016, 445-468.

[15] D.J.Greenhoe, Properties of distance spaces with power triangle inequalities, 2016. https://doi.org/10.7287/peerj.preprints.2055v1.

[16] R.Heckmann, Approximation of metric spaces by partial metric spaces, InformatikBerichte 96-04, Technische Universitat Braunschweig, 1996, Workshop Domains II.

[17] A.Islam, D.Inkpen, Semantic text similarity using corpus-based word similarity and string similarity ACM Transactions on Knowledge Discovery from Data, 2(2008), no. 2, 1-25.

[18] J.C.Kelly, Bitopological spaces, Proc. London Math. Soc., 13(1963), no. 3, 71-89.

[19] M.Kleindessner, U. von Luxburg, Uniqueness of Ordinal Embedding JMLR: Workshop and Conference Proceedings, vol. 35, 1-28, 2014.

[20] L.Leydesdorff, L.Vaughan, Co-occurrence matrices and their applications in information science: extending ACA to the web environment, Journal of the American Society for Information Science and Technology, 57(2006), no. 12, 1616-1628.

[21] S.Lim, Cleansing Noisy City Names in Spatial Data Mining. 2010 International Conference on Information Science and Applications (ICISA), 2010.

[22] S.G.Matthews, Partial metric topology, in: Proc. 8th Summer Conference on General Topology and Applications, Ann. New York Acad. Sci., 728(1994), 183-197.

[23] E.Megnigbeto, Controversies arising from which similarity measures can be used in co-citation analysis, Malaysian Journal of Library & Information Science, 18(2013), no. 2, 25-31.

[24] G.Navarro, R.Paredes, N.Reyes, C.Bustos, An empirical evaluation of intrinsic dimension estimators, Information Systems, 64(2017), 206-218.

[25] V.W.Niemytzki, On the "third axiom of metric space", Trans. Amer. Math. Soc., 29(1927), 507-513.

[26] K.Nyirarugira, T.Kim, Stratified gesture recognition using the normalized longest common subsequence with rough sets, In Signal Processing: Image Communication, Vol. 30, 2015, 178-189, ISSN 0923-5965. https://doi.org/10.1016/j.image.2014.10.008.

[27] S.J. O'Neill, Two topologies are better than one, Technical report, University of Warwick, April 1995.

[28] R.N.Shepard, The analysis of proximities: Multidimensional scaling with an unknown distance function. I, Psychometrika, 27(1962), 125-140.

[29] R.N.Shepard, Representation of structure in similarity data: Problems and prospects. Psychometrika, 39(1974), no. 4, 373-422.

[30] T.Skopal, On fast non-metric similarity search by metric access methods, In Proc. 10th International Conference on Extending Database Technology (EDBT'06), LNCS 3896, Springer, 2006, 718-736.

[31] T.Skopal, B.Bustos, On nonmetric similarity search problems in complex domains, ACM Computing Surveys, 43(2011), no. 4, Article 34.

[32] A.Tversky, Features of similarity, Psychological Review, 84(1977), 327-352.

[33] A.Tversky, I.Gati, Similarity, separability and the triangle inequality, Psychological Review, 89(1982), 123-154.

[34] A.Ugon, T.Nicolas, M.Richard, P.Guerin, P.Chansard, C.Demoor, L.Toubiana, (2015). A new approach for cleansing geographical dataset using Levenshtein distance, prior knowledge and contextual information, Studies in health technology and informatics, Vol. 210, 227-229. 10.3233/978-1-61499-512-8-227.

[35] M.Vlachos, G.Kollios, D.Gunopulos, Discovering similar multidimensional trajectories, In Proceedings of the International Conference on Data Engineering, ICDE '02, San Jose, CA, USA, IEEE Computer Society Press, 2002, 673-684.

[36] W.A.Wilson, On quasi-metric spaces, Am. J. Math., 53(1931), 675-684.

[37] S.V.Znamenskij, Models and axioms for similarity metrics Programmnye systemy: Theoriya i prilozheniya, 8(2017), no. 4, 247-357 (in Russian).

[38] S.Znamenskii, V.Dyachenko, An Alternative Model of the Strings Similarity, Selected Papers of the XIX International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2017), CEUR Workshop Proceedings (CEUR-WS.org), 177-183 (in Russian).

От сходства к метрике: система аксиом, монотонные преобразования и метрическая определенность

Сергей В. Знаменский

Институт программных систем им. А. К. Айламазяна РАН ул. Петра I, 4а, с. Веськово, Ярославская обл., Переславский район, 152021

Россия

Исследуется сохранение порядка преобразованиями произвольной метрики (сходства или 'расстояния) в метрическое или полуметрическое пространство. Вводится система аксиом, по-новому объединяющая известные обобщения метрик расстояния и метрик сходства, коэффициент корреляции Пирсона и косинус угла между векторами. Сохраняющие порядок (как монотонные, так и стержнево-монотонные) преобразования метрик эквивалентно определяются в различных терминах. Метрическая определенность среди стержнево-монотонных преобразований выпуклых метрических подпространств R" и Z доказывается при условии выпуклости метрики ррасстояния. Обсуждаются формулы ускоренной монотонной нормализации метрик сходства.

Ключевые слова: метрическое пространство, аксиомы сходства, нормализация сходства, метрическая определённость, длиннейшая общая подпоследовательность

i Надоели баннеры? Вы всегда можете отключить рекламу.