Научная статья на тему 'Stable assessment of the quality of similarity algorithms of character strings and their normalizations'

Stable assessment of the quality of similarity algorithms of character strings and their normalizations Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
134
78
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
DATA ANALYSIS / DISTANCE METRIC / NUMERIC EVALUATION / QUALITY ASSESSMENT / SIMILARITY METRIC / STRING SIMILARITY / АНАЛИЗ ДАННЫХ / МЕТРИКА ПОДОБИЯ / МЕТРИКА РАССТОЯНИЯ / ОЦЕНКА КАЧЕСТВА / СХОДСТВО СТРОК / ЧИСЛОВАЯ ОЦЕНКА

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Znamenskij Sergej Vital'Evich

The choice of search tools for hidden commonality in the data of a new nature requires stable and reproducible comparative assessments of the quality of abstract algorithms for the proximity of symbol strings. Conventional estimates based on artificially generated or manually labeled tests vary significantly, rather evaluating the method of this artificial generation with respect to similarity algorithms, and estimates based on user data cannot be accurately reproduced. A simple, transparent, objective and reproducible numerical quality assessment of a string metric. Parallel texts of book translations in different languages are used. The quality of a measure is estimated by the percentage of errors in possible different tries of determining the translation of a given paragraph among two paragraphs of a book in another language, one of which is actually a translation. The stability of assessments is verified by independence from the choice of a book and a pair of languages. The numerical experiment steadily ranked by quality algorithms for abstract character string comparisons and showed a strong dependence on the choice of normalization.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Устойчивая оценка качества алгоритмов сходства символьных строк и их нормализаций

Выбор средств поиска скрытой общности в данных новой природы требует устойчивых и воспроизводимых сравнительных оценок качества абстрактных алгоритмов близости символьных строк. Обычные оценка на основе искусственно сгенерированных или вручную размеченных тестов существенно разнятся, надёжнее оценивая метод этой искусственной генерации по отношению к алгоритмам сходства, а оценки на базе данных пользователей не могут быть точно воспроизведены. Предложена простая, прозрачная, объективная и воспроизводимая численная оценка качества метрики на строках. Используются параллельные тексты переводов книг на разные языки. Качество меры оценивается процентом ошибок в возможных различных попытках определения перевода данного абзаца среди двух абзацев книги на другом языке, один из которых действительно является переводом. Устойчивость оценок верифицируется независимостью от выбора книги и пары языков. Численный эксперимент устойчиво отранжировал по качеству абстрактные алгоритмы сравнения символьных строк и показал сильную зависимость от выбора нормализации. (Англ.)

Текст научной работы на тему «Stable assessment of the quality of similarity algorithms of character strings and their normalizations»

ISSN 2079-3316 PROGRAM SYSTEMS: THEORY AND APPLICATIONS vol. 9, No4(39), pp. 561-578

UDC 004.416

S. V. Znamenskij

Stable assessment of the quality of similarity algorithms of character strings and their normalizations

Abstract. The choice of search tools for hidden commonality in the data of a new nature requires stable and reproducible comparative assessments of the quality of abstract algorithms for the proximity of symbol strings. Conventional estimates based on artificially generated or manually labeled tests vary significantly, rather evaluating the method of this artificial generation with respect to similarity algorithms, and estimates based on user data cannot be accurately reproduced.

A simple, transparent, objective and reproducible numerical quality assessment of a string metric. Parallel texts of book translations in different languages are used. The quality of a measure is estimated by the percentage of errors in possible different tries of determining the translation of a given paragraph among two paragraphs of a book in another language, one of which is actually a translation. The stability of assessments is verified by independence from the choice of a book and a pair of languages.

The numerical experiment steadily ranked by quality algorithms for abstract character string comparisons and showed a strong dependence on the choice of normalization.

Key words and phrases: string similarity, data analysis, similarity metric, distance metric, numeric evaluation, quality assessment.

2010 Mathematics Subject Classification: 97P20; 91C05, 91C20.

Introduction

The task of comparing character strings arises when processing large data of a new, uncharted nature. Methods that routinely use syntax and semantics stop working. General algorithms for the similarity of symbolic

© S. V. Znamenskij, 2018

© Ailamazyan Program Systems Institute of RAS, 2018

© Program Systems: Theory and Applications (design), 2018

DO lYgHjj1

sequences are tried and adapted based on new knowledge of the applied area. So it is important to understand the effectiveness of well-known general algorithms and techniques for their application in comparison with each other.

Comparison of models and algorithms used for highlighting requires arrays of similar strings of various origins [1], which are usually comes from either unpublished personal data arrays [2-5], or from hand-marked linguistic corps or thesauri, as in [6], or from artificially generated data [7]. The public unavailability of some excludes the reproducibility of experiments and an independent assessment of the quality of the initial data, while the high labor-consuming nature of others also limits their volume and availability. The inaccessibility, small volume and unclear origin of the initial data deprive the experiments of persuasiveness.

There exists remarkable ability to freely use parallel texts in different languages for the evaluation of the quality of proximity metrics that were kindly selected and provided to researchers on the site http : //www.farkastranslations.com/bilingual_books.php by Hungarian programmer and translator Andras Farkas.

1. Purpose and rating scale, data sources

How does the model, algorithm and metric normalization affect the efficiency of an abstract (not using the specific alphabet, language and data) metrics (or similarity measures) of character strings? In searching for a transparent answer to this question, one can confine to well-known algorithms with widely used executable well-debugged executable code and with a clearly described model that does not require an empirical selection of parameters.

Usually for evaluations use (for example, in figures 3-6 in [8]) the completeness and quality of search results, monotonously connected through the organization of queries. However, the scalar characteristic is more convenient than the vector of two dependent characteristics. A simple and clear scalar measure of the (in) efficiency of the proximity metric is percentage of mistakenly selected translations defined as the average proportion of translation fragments that are closer to the metric under test than the correct translation fragment.

Stable quality assessment of strings similarity algorithms

Table 1. Parallel texts used

563

Author Title and Languages Number of Paragraphs Paragraph Size

Edgar Po Escher House Fall (en, hu, es, it, fr, de, eo) 7 x269 158 ±211

Mark Twain Tom Sawyer (en, de, hu, nl, ca) 5x 414 0 102 ±135

Lewis Carroll Alice in Wonderland (en, hu, es, it, pt, fr, de, eo, fi) 9x 805 174 ±245

For it, the inequality 0 ^ Es(p) ^ 100 is true, the ideal value is 0, and the value 50 means a result equivalent to random guessing, and Es (p) > 50 indicates an inadequate metric.

For the study were taken three described in Table 1 books in English (en), Hungarian hu, Spanish (es), Italian (it), Catalan (ca), German (de), Portuguese (pt), Finnish ( fi), French (fr) and Esperanto (eo).

2. Compared metrics

Well-known metrics included in the widely used R stringdist package participated in the tests. For clarity of discussion of the results, we briefly recall the compared metrics.

lcs(x, y) — the total number of deletions and inserts at the shortest transition from one substring to another. Is the metric normalization of the length of the LCS(x, y) of the longest common subsequence using the formula lcs (x, y) = l(x) + l(y) — 2LCS(x, y), where I is the length of the string.

lv (x, y) is the classical Levenshtein metric that counts the total number of replacements, deletions, and inserts when moving from one substring to another,

dl (x, y) is the Levenshtein-Damero metric, additionally counting unit permutations.

osa (x,y) ( Optimal string aligment) is a variation of the Levenshtein-Damero metric that allows multiple permutations.

jw (x, y) (Jaro metric) is not a metric in the strict mathematical sense of the distance between lines, more sophisticated taking into account the transposition, coincidence and position of characters.

jwp (x,y) (Jaro-Winkler metric) — Winkler's Jaro metric correction with the deforming correction parameter p = 0.1.

qgraml (x, y) is the number of different characters including repetitions, that is, the sum of all the letters Sj G {si,..., sn} of the expression alphabet |Xj — Yj| where X and Y are the vector of the numbers of occurrences of all characters of the alphabet in each of the compared lines.

cosinel (x,y) is calculated using the formula 1 — ^i'^h .

qgram2 (x, y) is the number of different diagrams (pairwise combinations) of characters, taking into account repetitions.

cosine2 (x,y) is calculated by the same cosinel formula for digrams.

qgram3 (x, y) is the number of different trigrams (triple combinations) of characters, taking into account repetitions.

cosine3 (x, y) is calculated using a similar formula for trigrams.

A detailed description of these metrics is provided in [9] with links to sources.

Additionally, the experimentally selected normalization of NCS/OCS similarity metrics, promoted by the author as a more effective alternative to LCS, proposed and investigated in [10-12], were considered. Briefly repeating, NCS is the maximum possible number of different common substrings in a common subsequence of symbols, which is bounded by a value ^(n

) = nCn+i) for

a string and its substring of length n, and

OCS(x, y) = ^-1(NCS(x,y)) = v^^^hi+i is LCS-like normalization of NCS. The similarity metrics are directed opposite to the distance metrics [13,14] and use differently defined normalization of distance metrics as distance metrics. During the experiments, simple and efficient functions were distinguished for using these similarity metrics as distance metrics to determine the order of the pairs:

, , 1(x) + l(y) — 3 NCS(x, y) NCS1(x, y) = Ky' v

NCS2(x,y)

1 - NCS(x,y) l(x) + l(y) '

Stable quality assessment of strings similarity algorithms 565

OCS1(x, y) = l(x) + l(y) - 2 OCS(x, y), 1(x) + l(y) - 2 OCS(x, y)

OCS2(x, y) =

Prepared for comparison graphs also present the lengths difference LENGTH(x,y) = |1(x) — 1(y)| as a simple distance function and the average of all metrics AVERAGE. Like the stringdist packet metrics, all of these functions except OCS1 are not metrics in the strict sense of the word, but with a little complication (the construction from the clause Basic definitions in [15]) can be replaced by metrics in the strict sense defining the same order relation on pairs.

For calculations, in addition to the stringdist metrics in question, we used C code, published in [16] and launched from Perl XS. For basic processing, a Perl script was used. Archive with scripts and main results of

processing is attached to the article file.

3. Setting and the result of the first experiment

Since not all metric calculation procedures support utf8, transliteration of the diacritics was required. For this purpose, the packages Text :: Unaccent and Text :: Unidecode were used in the procedure sub{unac_string('utf8', unidecode(ic $_[0]))} after which all non-ascii characters were removed from the lines.

Script to get information about languages on behalf of the user. The calculated values are recorded in a separate file with labels and languages. Immediate archiving of Bzip2 is about three times (up to 14 GB) reduced the amount of recorded information about metrics. Used books have less than 3% of available texts. Processing more is suppressed by the quadratic computational complexity of the problem. In particular, a distance matrix can not be calculated at all on a 64-bit computer for the "Three Musketeers" book.

In the event of a computer freezing or an unintended power outage (calculating metrics on a PC with a four core processor and 16 Gb of RAM required several days), such an organization allowed the calculations to continue from the time the archive was last recorded. Reuse of calculated

Table 2. Values of errors of metrics in the group (1) ({de, en}, {es,fr}, {es,it}, {fr,it})

metric Fall Tom Alice total

OCS2 1.6% ± 1.7% 4.4% ± 0.9% 4.1% ± 0.7% 3.1% ± 1.8%

NCS2 2.2% ± 2.6% 4.6% ± 0.2% 4.1% ± 0.6% 3.3% ± 2.0%

NCS1 4.5% ± 5.6% 8.1% ± 0.9% 7.0% ± 1.5% 6.0% ± 4.1%

qgram1 4.9% ± 2.9% 9.8% ± 2.5% 8.9% ± 2.1% 7.2% ± 3.3%

jwp 5.6% ± 3.4% 7.4% ± 0.3% 9.3% ± 1.3% 7.4% ± 3.0%

jw 5.6% ± 3.7% 8.4% ± 0.9% 9.1% ± 1.4% 7.5% ± 3.2%

LENGTH 6.8% ± 1.2% 11.9% ± 0.9% 11.2% ± 1.4% 9.3% ± 2.6%

dl 6.4% ± 8.1% 17.1% ± 7.9% 13.3% ± 6.5% 10.7% ± 8.4%

osa 6.5% ± 8.1% 17.2% ± 7.9% 13.3% ± 6.6% 10.7% ± 8.4%

lv 6.5% ± 8.2% 17.3% ± 8.0% 13.5% ± 6.6% 10.8% ± 8.5%

cosine3 10.3% ± 10.2% 17.3% ± 0.7% 17.3% ± 4.4% 14.2% ± 8.2%

AVERAGE 13.8% ± 6.7% 21.8% ± 1.5% 19.8% ± 2.7% 17.4% ± 5.9%

cosine2 16.6% ± 10.4% 21.3% ± 1.1% 24.2% ± 4.7% 20.5% ± 8.4%

cosine1 25.7% ± 7.3% 29.5% ± 2.0% 33.1% ± 5.5% 29.4% ± 7.0%

qgram2 20.0% ± 15.6% 44.1% ± 1.3% 31.1% ± 10.0% 27.6% ± 14.6%

lcs 18.8% ± 16.1% 41.8% ± 2.2% 36.4% ± 5.8% 29.2% ± 14.8%

qgram3 38.7% ± 6.0% 49.4% ± 0.2% 44.6% ± 2.4% 42.5% ± 5.7%

OCS1 47.2% ± 1.8% 51.3% ± 0.1% 48.0% ± 0.8% 48.0% ± 1.8%

metric values saved time for experiments on the selection of suitable normalization of NCS and OCS metrics.

The processing of each translation consisted in calculating the error of the metric

^ |{y € Y : m(x,y) < m(x,yx)}| (1) E (m) = ^-X|-|Y--100%,

where X and Y are the set of parallel text paragraphs in two different languages, |X| and |Y| are the powers of these sets, m — the metric under test, and yx — the translation of the paragraph x in the set Y.

The pairs of common languages of books were divided into four groups according to the proximity of transliterated paragraphs:

(1) most close {de, en}, {es, fr}, {es, it}, {fr, it};

(2) relatively close {en, eo}, {en, es}, {en, fr}, {en, it}, {eo, es}, {eo, it};

(3) relatively far {de, es}, {de, eo}, {de, fr}, {de, it}, {es, hu}, {hu, it};

(4) most far {de, hu}, {en, hu}, {eo, hu}, {fr, hu}.

Stable quality assessment of strings similarity algorithms

567

Table 3. Values of errors of metrics in the group (2) ({en, eo}, {en,es}, {en,fr}, {en,it}, {eo,es}, {eo,it})

metric Fall Alice total

OCS2 1.6% ± 0.8% 5.8% ± 1.1% 3.7% ± 2.3%

NCS2 2.4% ± 0.8% 6.7% ± 0.9% 4.6% ± 2.4%

LENGTH 7.3% ± 1.4% 11.1% ± 1.4% 9.2% ± 2.4%

NCS1 5.2% ± 1.7% 12.5% ± 2.3% 8.8% ± 4.2%

qgram1 7.4% ± 1.8% 11.9% ± 2.5% 9.6% ± 3.1%

jw 8.7% ± 2.0% 12.4% ± 1.1% 10.5% ± 2.5%

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

jwp 9.0% ± 2.0% 12.4% ± 1.2% 10.7% ± 2.4%

dl 11.1% ± 6.1% 19.7% ± 6.3% 15.4% ± 7.6%

osa 11.1% ± 6.1% 19.8% ± 6.3% 15.5% ± 7.6%

lv 11.3% ± 6.1% 20.0% ± 6.3% 15.6% ± 7.6%

cosine3 12.3% ± 2.9% 21.9% ± 2.2% 17.1% ± 5.4%

AVERAGE 19.0% ± 1.7% 24.8% ± 1.5% 21.9% ± 3.3%

cosine2 22.4% ± 2.8% 31.7% ± 1.5% 27.0% ± 5.2%

cosine1 32.8% ± 1.9% 39.8% ± 1.8% 36.3% ± 3.9%

qgram2 36.5% ± 5.8% 42.0% ± 4.5% 39.2% ± 5.9%

lcs 39.4% ± 2.4% 44.6% ± 1.3% 42.0% ± 3.3%

qgram3 44.3% ± 1.6% 47.0% ± 0.8% 45.7% ± 1.8%

OCS1 48.5% ± 0.6% 48.9% ± 0.5% 48.7% ± 0.6%

The results of the experiment showed in the tables 2, 3, 4 and 5 high stability of the ranking of metrics by quality, almost independent either of the book, or of a particular pair of languages in the group. The results are graphically presented in Figure 1; the percentage of error is plotted vertically, pairs of languages are ordered to the right in descending order of the average error.

The graphs show that the sharply increased spread of metrics dl, lv, osa is closely related to the significant influence of the order of languages in a pair and the difference in paragraph lengths.

Surprising that the ranking of metrics by quality looks almost unrelated to the complexity of the algorithms: The simplest algorithm that calculates the difference in paragraph lengths turned out to be one of the best. This confirms the hypothesis of the exceptional importance of the correct choice of the normalization of the metric.

Table 4. Values of errors of metrics in the group (3) ({de, es}, {de,eo}, {de,fr}, {de,it}, {es,hu}, {hu,it})

metric OCS2 LENGTH NCS2 qgram1 jw jwp NCS1 dl osa lv

AVERAGE

cosine3

cosine2

cosine1

qgram2

lcs

qgram3 OCS1

Fall 7.1% ± 1.1% 9.2% ± 1.0% 12.2% ± 1.5% 12.6% ± 3.6% 16.0% ± 2.6% 16.4% ± 2.5% 22.7% ± 3.0% 24.6% ± 7.1% 24.7% ± 7.1% 24.9% ± 7.1% 29.6% ± 1.4% 35.1% ± 1.0% 39.6% ± 1.0% 41.7% ± 1.4% 48.2% ± 0.6% 48.4% ± 0.7% 49.8% ± 0.4% 50.6% ± 0.3%

Alice 9.4% ± 2.2% 12.1% ± 2.4% 12.0% ± 1.8% 14.9% ± 3.9% 16.5% ± 2.8% 16.4% ± 2.8% 21.0% ± 2.7% 25.5% ± 6.7% 25.6% ± 6.7% 25.8% ± 6.7% 29.6% ± 1.8% 32.6% ± 1.8% 38.4% ± 2.2% 43.7% ± 2.4% 46.9% ± 0.9% 47.5% ± 0.6% 48.7% ± 0.5% 49.3% ± 0.4%

total 8.2% ± 2.1% 10.6% ± 2.4% 12.1% ± 1.7% 13.7% ± 3.9% 16.2% ± 2.7% 16.4% ± 2.7% 21.9% ± 3.0% 25.1% ± 6.9% 25.2% ± 6.9% 25.3% ± 6.9% 29.6% ± 1.6% 33.9% ± 1.9% 39.0% ± 1.8% 42.7% ± 2.2% 47.5% ± 1.0% 47.9% ± 0.8% 49.2% ± 0.7% 50.0% ± 0.8%

Table 5. Values of errors of metrics in the group (4) ({de, hu}, {en,hu}, {eo,hu}, {fr,hu})

metric OCS2 LENGTH NCS2 qgram1 jw jwp NCS1 dl osa lv

AVERAGE

cosine3

cosine2

cosine1

qgram2

lcs

qgram3 OCS1

Fall 7.2% ± 1.8% 8.7% ± 2.2% 13.6% ± 1.7% 14.0% ± 5.2% 18.5% ± 3.2% 19.3% ± 3.1% 25.9% ± 3.5% 26.0% ± 9.3% 26.0% ± 9.3% 26.2% ± 9.3% 30.9% ± 1.6% 35.9% ± 1.0% 40.0% ± 1.3% 42.1% ± 1.6% 48.6% ± 0.7% 49.2% ± 0.5% 50.3% ± 0.4% 51.0% ± 0.2%

Tom 11.5% ± 0.8% 14.7% ± 1.2% 17.5% ± 0.7% 19.8% ± 2.8% 21.0% ± 0.5% 21.2% ± 0.3% 26.1% ± 0.9% 29.5% ± 3.9% 29.6% ± 3.9% 29.7% ± 3.9% 32.5% ± 0.7% 31.0% ± 0.5% 36.5% ± 0.6% 40.9% ± 0.7% 50.3% ± 0.5% 50.1% ± 0.5% 51.7% ± 0.4% 52.5% ± 0.4%

Alice 12.8% ± 1.7% 15.2% ± 2.0% 15.6% ± 1.2% 18.9% ± 5.1% 20.8% ± 1.8% 20.9% ± 1.7% 26.2% ± 2.4% 28.4% ± 8.4% 28.4% ± 8.4% 28.5% ± 8.4% 32.3% ± 1.7% 35.5% ± 1.3% 41.3% ± 0.6% 45.8% ± 0.9% 47.7% ± 0.7% 48.2% ± 0.6% 48.8% ± 0.5% 49.2% ± 0.5%

total 10.3% ± 3.0% 12.5% ± 3.6% 15.2% ± 2.0% 17.1% ± 5.4% 19.9% ± 2.6% 20.3% ± 2.4% 26.0% ± 2.7% 27.6% ± 8.2% 27.7% ± 8.2% 27.8% ± 8.2% 31.8% ± 1.7% 34.8% ± 2.2% 39.8% ± 2.0% 43.4% ± 2.4% 48.6% ± 1.2% 49.0% ± 0.9% 50.0% ± 1.2% 50.6% ± 1.3%

OCS1 qgram3 les

qgram2

cosinel

cosine2

AVERAGE

cosine3

lv

OCS1

- qgram3

- Ics qgram2

- cosinel

- cosine2 -AVERAGE

cosine3 I- lv

— NCS1 •-jwp

— jw

— qgram1 I- LENGTH i- NCS2

OCS2

(a) Edgar Allan Poe. Falling of the Escher House

(b) Mark Twain. Tom Sawyer

(c) Lewis Carroll. Alice in Wonderland

OCSl ¡- qgram3 «-lcs

H qgram2 cosine1 cosine2 -AVERAGE

cosine3 I- lv osa dl

NCS1 jwp

— jw

— qgram1 I-LENGTH

NCS2 OCS2

fi fr hu fi it fi fi es de fi hu fr hu de hu eo en fi eofi hu es pt fi hu pt hu en hu it de eo de it de es pt de de fr en eo en it eo es eo fr en es eo it en fr pt eo en de es fr pt fr it fr pt it en pt it es pt es fr fi fi hu fi it esfi fi defr hudehueohufi enfi eoeshu fi pt pt huenhuit hueode it deesdedept fr deeoenit eneseofr eoesenit eofr eneopt deenfr es fr pt fr it it pt pt enesit espt

Figure 1. Percentage of a binary choice of correct paragraph translation in a multilingual book

Ol

Cft

co

Table 6. Error of metrics with equal lengths of arguments in a group oflanguage pairs({de, en}, {es,fr}, {es,it}, {fr, it})

metric

lcs

NCS1, NCS2,

OCS1, OCS2

dl,lv,osa

qgram3

AVERAGE

qgram2

cosine3

jwp

cosine2

jw

qgram1 cosine1

Fall 7.4% ± 10.0%

8.4% ± 10.3%

8.7% ± 9.4% 8.5% ± 9.5% 11.7% ± 10.0% 11.5% ± 12.0% 10.4% ± 10.5% 15.7% ± 8.2% 14.5% ± 11.6% 16.6% ± 9.0% 17.1% ± 10.4% 24.6% ± 9.1%

Tom 8.9% ± 0.2%

8.6% ± 0.1%

9.3% ± 0.2% 13.1% ± 0.3% 13.6% ± 0.3% 14.4% ± 0.4% 17.0% ± 0.4% 15.0% ± 0.4% 20.0% ± 0.4% 17.5% ± 0.4% 19.0% ± 0.8% 29.1% ± 1.1%

Alice 8.6% ± 2.3%

8.0% ± 2.2%

9.7% ± 2.4% 11.3% ± 4.2% 13.9% ± 3.0% 13.3% ± 4.1% 14.9% ± 4.7% 20.8% ± 3.0% 19.2% ± 4.6% 20.7% ± 3.4% 21.1% ± 3.8% 30.5% ± 4.6%

total 8.1% ± 6.9%

8.2% ± 7.0%

9.2% ± 6.5% 10.3% ± 7.1% 12.9% ± 7.0% 12.6% ± 8.5% 13.2% ± 8.1% 17.9% ± 6.4% 17.2% ± 8.7% 18.5% ± 6.7% 19.1% ± 7.6% 27.8% ± 7.4%

4. Experiment Eliminating the Effect of Normalization

To eliminate the effect of normalization, modify the formula (1) as

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

follows:

^ |{y € Y : m(x, y) < m(x, yx)& l(y) = 1(yx)}|

(2) E=(m) = ^-X|-|Y--100%,

Rigid selection of arguments of metrics by equality of lengths naturally aligns the scatter of results and dramatically changes the rating. Under these conditions, simple formulae for a normalization are turned off and the quality of complex calculations comes to the fore, see Table 6.

Now also in the graphs on Figure 2, a sharply different order of metrics is clearly visible. In particular, two metrics with the largest errors lcs and qgram3 turn out to be the best after NCS/OCS.

The emergence of a hypothesis about the possibilities of a better selection of normal norms. It is natural to expect that with the optimal choice of rating for the overall situation will be quite long. For example, normalization NLCS [17] of the LCS metric sets close to OCS2 order and can join the leaders.

cosine2 qgram2 -AVERAGE cosine3

les

OCS1 OCS2 NCS2 NCS1

— cosinel f- cosine2

— qgraml

— jw •-jwp

S- qgram2 j- cosine3 -AVERAGE

- les

- NCS1 OCS1 OCS2

i- NCS2

(a) Edgar Allan Poe. Falling of the Escher House

(b) Mark Twain. Tom Sawyer

(c) Lewis Carroll. Alice in Wonderland

-cosinel cosine2 - qgraml -jw »jwp

cosine3 i- qgram2 -AVERAGE

qgram3 ■-dl Hv osa lcs

NCS1 OCS1 Í-OCS2 ¡-NCS2

fr fi fi hu fi es fi it eo fi hu eo de fi fr hu de hu fi en fi pt hu pt hu en hu es de it it hu de eo pt de de es de fr en it eo en eo es en es eo i t eo fr fr en de en eo pt fr it fr es pt fr pt en pt it es it es pt fi fr hufi esfi it fifi eoeohufi dehufr hudeenfi pt fi pt huenhueshuit dehuit eodedept esdefr deit eneneoeseoesenit eofr eoenfr endept eo it fr esfr fr pt enpt it pt it espt es

Figure 2. Errors of metrics with equal long arguments

Ol -J

Table 7. Errors of metrics with equal long arguments in the group (2) of language pairs({en, eo}, {en, es}, {en, fr}, {en, it}, {eo,es}, {eo,it})

metric Fall Alice total

NCSl, NCS2, OCS1, OCS2 7.3% ± 3.0% 14.2% ± 1.4% 10.7% ± 4.2%

lcs 8.0% ± 2.5% 16.0% ± 2.2% 12.0% ± 4.7%

qgram3 9.1% ± 2.7% 16.4% ± 2.1% 12.8% ± 4.4%

dl,lv,osa 10.0% ± 3.9% 17.3% ± 2.4% 13.7% ± 4.9%

cosine3 11.2% ± 3.0% 20.8% ± 2.2% 16.0% ± 5.5%

AVERAGE 13.9% ± 2.6% 21.0% ± 1.6% 17.4% ± 4.2%

qgram2 13.5% ± 3.1% 21.4% ± 1.6% 17.5% ± 4.7%

cosine2 19.4% ± 3.8% 27.8% ± 1.9% 23.6% ± 5.1%

jwp 21.9% ± 3.5% 28.2% ± 1.5% 25.0% ± 4.2%

jw 22.3% ± 3.8% 28.8% ± 1.6% 25.6% ± 4.3%

qgraml 23.8% ± 3.3% 28.9% ± 1.4% 26.3% ± 3.6%

cosine1 32.1% ± 3.3% 37.0% ± 2.0% 3 6 % ± 3.7%

Table 8. Errors of metrics with equality of long arguments for language pairs (3):({de, es}, {de, eo}, {de, fr}, {de, it}, {es, hu}, {hu, it})

metric NCS1, NCS2, OCS1, OCS2 lcs

dl,lv,osa

qgram3

AVERAGE

qgram2

cosine3

jwp

qgraml

jw

cosine2 cosine1

Fall

Alice

total

29.8% ± 3.4% 24.3% ± 2.0% 27.1% ± 3.9%

31.9% 32.9% 34.9% 34.4% 36.6% 36.7% 36.3% 36.0% 36.7% 39.9% 42.5%

± 3.3% ± 3.0% ± 4.1% ± 1.2% ± 2.0% ± 3.8% ± 2.4% ± 3.1% ± 2.6% ± 3.1% ± 2.4%

26.2% 28.2% 29.0% 31.0% 32.7% 32.9% 36.0% 36.5% 36.4% 37.4% 43.8%

± 2.7% ± 2.8% ± 2.3% ± 1.9% ± 2.5% ± 2.2% ± 2.3% ± 3.2% ± 2.3% ± 2.1% ± 2.3%

29.0% 30.6% 32.0% 32.7% 34.6% 34.8% 36.1% 36.3% 36.5% 38.7% 43.2%

± 4.1% ± 3.8% ± 4.4% ± 2.3% ± 3.0% ± 3.7% ± 2.4% ± 3.2% ± 2.5% ± 2.9% ± 2.5%

5. Other Comparison Situations

For this purpose, results with other restrictions on the lengths of the metrics attached to the article file (Table 7,Table 8) may be useful. The choice of a suitable metric and normalization obviously should focus on the features of a specific task.

For example, Figure 3 presents graphs for 10% - restrictions on the difference of lengths.

h LENGTH

cosinel OCS1 I- qgraml — jw jwp

cosine2 f- qgram3 *• qgram2 -AVERAGE cosine3

NCS1 NCS2 OCS2

H LENGTH

— cosinel OCS1

— qgraml

— jw jwp

cosine2 qgram3 qgram2 -AVERAGE cosine3

NCS1 i- NCS2

OCS2

(a) Edgar Allan Poe. Falling of the Escher House

(b) Mark Twain. Tom Sawyer

I-LENGTH -cosinel OCS1 cosine2 - qgram1 !- qgram3 -jw jwp r qgram2 -AVERAGE cosine3

NCS1 NCS2 i- OCS2

fr fi hu fi fi es fi it eo fi de fi eo hu fi en fi pt de hu fr hu hu en hu pt hu es it hu de it de eo de es de pt de fr eo en it en eo es es en eo it eo fr fr en de en eo pt fr it fr es fr pt pt en pt it it es pt es fi fr fi huesfi it fifi eofi dehueoenfi pt fi hudehufr enhupt hueshuhuit it deeodeesdept defr deeneoenit eseoenesit eofr eoenfr endept eo it fr esfr pt fr enpt it ptesit espt

(c) Lewis Carroll. Alice in Wonderland

Figure 3. Errors of metrics with arguments lengths differ by ^ 10%

Table 9. Values of errors of metrics in the group (4) of the languages pairs ({de,hu}, {en,hu}, {eo,hu}, {fr,hu})

metric NCS1, NCS2, OCS1,OCS2 lcs

qgram3

dl,lv,osa

cosine3

AVERAGE

qgram2

qgram1

jwp

jw

cosine2 cosine1

Fall

Tom

Alice

total

32.8% ± 2.4% 29.5% ± 1.0% 30.4% ± 1.4% 31.2% ± 2.3%

33.2% 33.4% 34.3% 35.1% 36.3% 37.0% 39.6% 41.0% 41.1% 40.8% 43.1%

± 2.4% ± 3.3% ± 4.3% ± 3.1% ± 2.2% ± 3.2% ± 3.4% ± 2.3% ± 2.4% ± 3.5% ± 3.4%

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

30.8% 28.9% 32.6% 31.9% 33.6% 34.7% 37.4% 36.3% 36.9% 38.5% 43.4%

± 1.3% ± 0.6% ± 1.0% ± 0.6% ± 1.0% ± 0.8% ± 1.0% ± 1.3% ± 1.2% ± 0.8% ± 1.1%

32.2% 33.0% 33.5% 35.6% 35.6% 37.8% 39.9% 39.6% 40.0% 41.2% 46.0%

± 1 . 8% ± 1.5% ± 1.1% ± 1 . 4% ± 0.9% ± 1.9% ± 1.3% ± 1 . 4% ± 1 . 4% ± 1 . 0% ± 0.8%

32.3% 32.3% 33.6% 34.7% 35.5% 36.8% 39.3% 39.5% 39.8% 40.5% 44.3%

± 2.2% ± 2.9% ± 2.9% ± 2.6% ± 1.9% ± 2.6% ± 2.5% ± 2.5% ± 2.4% ± 2.5% ± 2.7%

Figure 4 shows graphs with a restriction on the length l(y) ^ 1(yx), and on Figure 5 graphs with the opposite restriction l(y) ^ 1(yx). We see a sharply manifested difference in practical problems, by the nature of which the correct choice usually has close to the shortest or close to the greatest length.

Conclusion

Experiments have shown that the effectiveness of the strings similarity metrics critically depends on the matching of the normalization choice of the algorithm to the distribution of the lengths in the data.

Difficult questions became opened:

• How to calculate the most effective formula for the normalization of a given metric from specific data?

• Will the calculated formulas give a significant gain for the metrics considered?

• How to calculate the appropriate normalization of a given metric from data statistics?

• How to estimate the adequacy of the normalization of a given metric by data statistics?

It seems reasonable to continue research in search of answers to these questions.

OCS1 qgram3

-AVERAGE cosinel qgraml cosine2 cosine3 LENGTH

OCS2 NCS2 NCS1

\I\N\ A /

À A

\ A A "li W

Ix^VV-^V

■NU m

OCS1

- qgram3

- qgram2

■-AVERAGE

— qgraml

— cosinel H LENGTH i- cosine3 !- cosine2

OCS2 jwp

— jw

i- NCS2

— NCS1

(a) Edgar Allan Poe. Falling of the Escher House

(b) Mark Twain. Tom Sawyer

OCS1 i- qgram3

-AVERAGE - qgram1 cosine1 cosine3 LENGTH cosine2 OCS2

jwp

-NCS2 -NCS1

fi fr hu fi fi de fi es hu de it fi en fi fi pt hu fr fi eo hu eo hu pt hu es hu it hu en eo de pt de it de de es en eo de fr eo fr eo es it en en es eo pt it eo en fr en de it fr pt fr fr es en pt it pt it es pt es fr fi fi hudefi esfi de hu fi it fi en pt fi fr hueofi eohupt hueshu it huenhudeeodept deit esdeeoenfr defr eoeseoenit esenpt eoeoit fr endeen fr it fr ptesfr pt en pt it es it espt

(c) Lewis Carroll. Alice in Wonderland Figure 4. Metric errors when the correct answer is shorter (errors larger 50% are not shown)

cosine1 cosine2 cosine3 NCS1 jwp -jw

-AVERAGE

NCS2 I- LENGTH - qgram1 OCS2 i- qgram2 dl

osa

^ qgram3

OCS1

cosine1 cosine2 cosine3 NCS1 — jw I-jwp

NCS2 ■-AVERAGE HLENGTH OCS2 qgram1 -dl osa

qgram3 OCS1

(a) Edgar Allan Poe. Falling of the Escher House

(b) Mark Twain. Tom Sawyer

cosine1 cosine2 cosine3 NCS1 I-jwp -jw

-AVERAGE i- NCS2 I-LENGTH

OCS2 - qgram1

i- qgram3 -OCS1

fi hu fr fi fi it es fi eo hu fr hu de fi eofi de hu en hu es hu it hu enfi pt hu pt fi de it de eo de es de fr de pt en it eo en eo it es eo es en fr enfr eo es fr de en pt fr pt eo fr it pt it pt en es it pt es hufi fi fr it fi fi eshueohufr fi defi eohudehuenhueshuit fi enhupt fi pt it deeodeesdefr dept deit eneneoit eoeoesenesenfr eofr fr esendefr pteopt it fr it ptenpt it esespt

(c) Lewis Carroll. Alice in Wonderland

Figure 5. Metric errors when the correct answer is longer (errors larger 50% are not shown)

Stable quality assessment of strings similarity algorithms

577

References

[1] W. W. Cohen, P. Ravikumar, S. Fienberg. "A comparison of string distance metrics for name-matching tasks", IIWEB'03 Proceedings of the 2003 International Conference on Information Integration on the Web (August 09-10, 2003, Acapulco, Mexico), 2003, pp. 73-78. .url: 662

[2] K. Branting. "A comparative evaluation of name-matching algorithms", ICAIL '03 Proceedings of the 9th international conference on Artificial intelligence and law (June 24-28, 2003, Scotland, United Kingdom), 2003, pp. 224-232.

[3] P. Christen. "A comparison of personal name matching: Techniques and practical issues", Proceedings of the Sixth IEEE International Conference on Data Mining — Workshops (ICDMW'06) (December 18-22, 2006, Hong Kong, China), IEEE, New York, 2006, pp. 290-294. I 1662

[4] G. Recchia, M. Louwerse. "A comparison of string similarity measures for toponym matching", COMP '13 Proceedings of The First ACM SIGSPATIAL International Workshop on Computational Models of Place (November 05—08, 2013, Orlando FL, USA), 2013, pp. 54-61. .url; t562

[5] N. Gali, R. Mariescu-Istodor, P. Franti. "Similarity measures for title matching", 2016 23rd International Conference on Pattern Recognition (ICPR) (December 4-8, 2016, Ca.ncun, México). I ' 662

[6] Yufei Sun, Liangli Ma, Shuang Wang. "A comparative evaluation of string similarity metrics for ontology alignment", Journal of Information & Computational Science, 12:3 (2015), pp. 957-964. .url, d -f662

[7] M. del Pilar Angeles, A. Espino Gamez. "Comparison of methods Hamming Distance, Jaro, and Monge-Elkan", DBKDA 2015: The Seventh International Conference on Advances in Databases, Knowledge, and Data Applications (May 24-29, 2015, Rome, Italy), .url; 662

[8] C. Varol, C. Bayrak. "Hybrid matching algorithm for personal names", ACM Journal of Data and Information Quality, 3:4 (2012), 8. 662

[9] M.P.J. van der Loo. "The stringdist package for approximate string matching", R Journal, 6:1 (2014), pp. 111-122. .url 664

[10] S. V. Znamenskij. "Simple essential improvements to ROUGE-W algorithm", Journal of Siberian Federal University. Mathematics & Physics, 8:4 (2015), pp. 258-270.

[11] S. V. Znamenskij. "A belief framework for similarity evaluation of textual or structured data, similarity search and applications", Similarity Search and Applications, SISAP 2015, Lecture Notes in Computer Science, vol. 9371, eds. G. Amato, R. Connor, F. Falchi, C. Gennaro, 2015, pp. 138-149. i 664

[12] S. V. Znamenskij. "A model and algorithm for sequence alignment", Program systems: theory and applications, 6:1 (2015), pp. 189-197. (url)' 664

[13] S. V. Znamenskij. "Models and axioms for similarity metrics", Program systems: theory and applications, 8:4(35) (2017), pp. 349-360 (in Russian).

[14] S. V. Znamenskij. "From similarity to distance: axoim set, monotonic transformatons and metric determinacy", Journal of Siberian Federal University. Mathematics & Physics, 11:3 (2018), pp. 331-341. I f564

[15] M. M. Deza, E. Deza. Encyclopedia of distances, Springer-Verlag, Berlin, 2009, 583 p. URL I • 565

[16] S. V. Znamenskij, V. A. Dyachenko. "An alternative model of the strings similarity", DAMDID/RCDL 2017 (Moscow, Russia, October 9-13, 2017), CEUR Workshop Proceedings, vol. 2022, Selected Papers of the XIX International Conference on Data Analytics and Management in Data Intensive Domains, eds. L. Kalinichenko, Y. Manolopoulos, N. Skvortsov, V. Sukhomlin, pp. 177-183 (in Russian). ,url)1565

[17] A. Islam, D. Inkpen. "Semantic text similarity using corpus-based word similarity and string similarity", ACM Transactions on Knowledge Discovery from Data, 2:2 (2008), 10, 25 p. I t5ro

Received 17.04.2018

Revised 03.12.2018

Published 28.12.2018

Recommended by dr. Evgeny Kurshev

Sample citation of this publication:

Sergej Znamenskij. "Stable assessment of the quality of similarity algorithms of character strings and their normalizations". Program Systems: Theory and. Applications, 2018, 9:4(39), pp. 561-578.

10.25209/2079-3316-2018-9-4-561-578 @ http://psta.psiras.ru/read/psta2018_4_561-578.pdf

About the author:

Sergej Vital'evich Znamenskij

Scientific interests migrated from functional analysis and complex analogues of convexity to the foundations of the development of collaborative software, similarity metrics and interpolation theories

e-mail: [email protected]

Эта же статья по-русски: 10.25209/2079-3316-2018-9-4-593-610

i Надоели баннеры? Вы всегда можете отключить рекламу.