Научная статья на тему 'WSD ALGORITHM BASED ON A NEW METHOD OF VECTOR-WORD CONTEXTS PROXIMITY CALCULATION VIA ε-FILTRATION'

WSD ALGORITHM BASED ON A NEW METHOD OF VECTOR-WORD CONTEXTS PROXIMITY CALCULATION VIA ε-FILTRATION Текст научной статьи по специальности «Математика»

CC BY
519
34
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
СИНОНИМ / SYNONYM / SYNSET / КОРПУСНАЯ ЛИНГВИСТИКА / CORPUS LINGUISTICS / WORD2VEC / ВИКИТЕКА / WIKISOURCE / WSD / RUSVECTORES / ВИКИСЛОВАРЬ / WIKTIONARY / СИНСЕТ

Аннотация научной статьи по математике, автор научной работы — Kirillov Alexander, Krizhanovskaya Natalia, Krizhanovsky Andrew

The problem of word sense disambiguation (WSD) is considered in the article. Set of synonyms (synsets) and sentences with these synonyms are taken. It is necessary to automatically select the meaning of the word in the sentence. 1285 sentences were tagged by experts, namely, one of the dictionary meanings was selected by experts for target words. To solve the WSD problem, an algorithm based on a new method of vector-word contexts proximity calculation is proposed. A preliminary є-filtering of words is performed, both in the sentence and in the set of synonyms, in order to achieve higher accuracy. An extensive program of experiments was carried out. Four algorithms are implemented, including the new algorithm. Experiments have shown that in some cases the new algorithm produces better results. The developed software and the tagged corpus have an open license and are available online. Wiktionary and Wikisource are used. A brief description of this work can be viewed as slides (https://goo.gl/9ak6Gt). A video lecture in Russian about this research is available online (https://youtu.be/-DLmRkepf58).

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «WSD ALGORITHM BASED ON A NEW METHOD OF VECTOR-WORD CONTEXTS PROXIMITY CALCULATION VIA ε-FILTRATION»

Transactions of Karelian Research Centre RAS No. 7. 2018. P. 149-163 DOI: 10.17076/mat829

Труды Карельского научного центра РАН

№ 7. 2018. С. 149-163

УДК 81.32

WSD ALGORITHM BASED ON A NEW METHOD OF VECTOR-WORD CONTEXTS PROXIMITY CALCULATION VIA e-FILTRATION

A. N. Kirillov, N. B. Krizhanovskaya, A. A. Krizhanovsky

Institute of Applied Mathematical Research of the Karelian Research Centre of the Russian Academy of Sciences

The problem of word sense disambiguation (WSD) is considered in the article. Set of synonyms (synsets) and sentences with these synonyms are taken. It is necessary to automatically select the meaning of the word in the sentence. 1285 sentences were tagged by experts, namely, one of the dictionary meanings was selected by experts for target words. To solve the WSD problem, an algorithm based on a new method of vector-word contexts proximity calculation is proposed. A preliminary e-filtering of words is performed, both in the sentence and in the set of synonyms, in order to achieve higher accuracy. An extensive program of experiments was carried out. Four algorithms are implemented, including the new algorithm. Experiments have shown that in some cases the new algorithm produces better results. The developed software and the tagged corpus have an open license and are available online. Wiktionary and Wikisource are used. A brief description of this work can be viewed as slides (https://goo.gl/9ak6Gt). A video lecture in Russian about this research is available online (https://youtu.be/-DLmRkepf58).

Keywords: synonym; synset; corpus linguistics; word2vec; Wikisource; WSD; RusVectores; Wiktionary.

А. Н. Кириллов, Н. Б. Крижановская, А. А. Крижановский. АЛГОРИТМ РЕШЕНИЯ WSD-ЗАДАЧИ НА ОСНОВЕ НОВОГО СПОСОБА ВЫЧИСЛЕНИЯ БЛИЗОСТИ КОНТЕКСТОВ С УЧЕТОМ е-ФИЛЬТРАЦИИ СЛОВ

Рассмотрена задача разрешения лексической многозначности (WSD), а именно: по известным наборам синонимов (синсеты) и предложений с этими синонимами требуется автоматически определить, в каком значении использовано слово в предложении. Экспертами были размечены 1285 предложений, выбрано одно из заранее известных значений (синсетов). Для решения WSD-задачи предложен алгоритм, основанный на новом способе вычисления близости контекстов. При этом для более высокой точности выполняется предварительная е-фильтрация слов, как в предложении, так и в наборе синонимов. Проведена обширная программа экспериментов. Реализовано четыре алгоритма, включая предложенный. Эксперименты показали, что в ряде случаев новый алгоритм показывает лучшие результаты. Разработанное программное обеспечение и размеченный корпус с открытой лицензией доступны онлайн. Использованы синсеты Викисловаря и тексты Викитеки. Краткое описание работы в виде слайдов доступно по ссылке (https://goo.gl/9ak6Gt), видео с докладом также доступно онлайн (https://youtu.be/-DLmRkepf58).

Ключевые слова: синоним; синсет; корпусная лингвистика; "^М2уес; Ви-китека; WSD; RusVectores; Викисловарь.

Introduction

The problem of word sense disambiguation (WSD) is a real challenge to computer scientists and linguists. Lexical ambiguity is widespread and is one of the obstructions in natural language processing.

In our previous work "Calculated attributes of synonym sets" [6], we have proposed the geometric approach to mathematical modeling of synonym set (synset) using the word vector representation. Several geometric characteristics of the synset words were suggested (synset interior, synset word rank and centrality). They are used to select the most significant synset words, i.e. the words whose senses are the nearest to the sense of the synset.

The topic related to polysemy, synonyms, filtering and WSD is continued in this article. Let us formulate the mathematical foundations for solving the problems of computational linguistics in this article.

Using the approach proposed in the paper [2], we present the WSD algorithm based on a new context distance (proximity) calculation via e-filtration. The experiments show the advantages of the proposed distance over the traditional average vectors similarity measure of distance between contexts.

New ^-proximity between finite sets

It is quite evident that the context distance choice is one of the crucial factors influencing WSD algorithms. Here, in order to classify discrete structures, namely contexts, we propose a new approach to context proximity based on Hausdorff metric and symmetric difference of sets: AAB = (A U B) \ (A n B).

Fig. 1. The set AAB is the shaded part of circles

Recall the notion of Hausdorff metric. Consider a metric space (X, g) where X is a set, q is a metric in X. Define the e-dilatation A + e of a set A c X

A + e = U[B£(x) : x e A},

where Bs(x) is a closed ball centered at x with the radius e.

The Hausdorff distance qh(A,B) between compact nonempty sets A and B is

gH(A, B) = minje > 0 : (A c B+e)A(B c A+e)},

where A + e, B + e are the e-dilatations of A and B. Consider the following sets (Fig. 2):

A(e) = A n (B + e), B(e) = B n (A + e).

Fig. 2. Two sets A + e and B + e are the e-dilatations of segments A and B, and two new proposed set-valued maps A(e) and B(e) were inspired by Hausdorff distance

Then

gH(A, B) = minje > 0 : A(e) U B(e) = A U B}.

Consider two contexts W\ = [wu, ...,w\m}, W2 = [w2i,..., w2n}, where wu, w2j are words in the contexts, i = l,..,m, j = l,...,n. Denote by Vi = [vn, ...,vlm}, V2 = [v21, ...,v2n} the sets of vectors vu, v2j corresponding to the words wu, w2j. Recall that generally in WSD procedures, the distance between words is measured by similarity function, which is a cosine of angle between vectors representing words: sim(v1,v2) = , where (v1,v2) is

a scalar (inner) product of vectors v1,v2, and ||t>i|| is a norm of vector, i = l, 2. In what follows, sim(v1,v2) e [—1, l]. Thus, the less distance the more similarity. Keeping in mind the latter remark, we introduce the following e-proximity

150

of vector contexts V1, V2. Given e ^ 0, construct Average algorithm with synonyms the sets ^-filtration

C(V1,V2,e) = [u,v : u G V1,v G V2, sim(u, v) ^ e}.

D{Vi,V2,e) = (Vi U V2) \ C (VhV2).

Supposing that sim plays the role of a metric, then C(V1,V2,£) is analogous to the expression A(e) U B(e) in the definition of the Hausdorff distance.

Denote by \Y | the power of a set Y C X, R+ = {x : x ^ 0,x G R}.

Definition 1. The K-proximity of contexts V1, V2 is the function

|C (V1,V2,e)|

K (Vi,V2,e) =

|V1 U V2I

It is clear that K(Vi,V2,e) G [0,1]. We also define the following function.

Definition 2. The K-proximity of contexts V1, V2 is the function

|C (Vi,V2,£)I

1 + lD(Vi,V2,e)y of "near" and "distant"

K(Vi,V2,e) =

describing the ratio elements of sets.

The definition implies that min K^V]^, V2,e) = 0, max K(V1,V2,e) = |V1 U V21. The presence of 1 in the denominator permits to avoid zero denominator when |D(V1,V2,£)| =0.

The ubiquitous distance q between contexts V1, V2 is based on the similarity of average vectors: q(V1,V2) = sim(V 1,V2). But the example (Fig. 3) shows that for two geometrically distant and not too similar structures g(V1,V2) = 1, that is the similarity q takes the maximum value.

Example

Consider the sets A = {a1,a2}, B = {b1} pictured in Fig. 3, where a1 + = 0 , a2 = b1. Then, sim(A,B) = sim( + a2 + ),b1) = sim(a,2,bi) = 1, K(A,B,e) = |, K(A,B,e) = ].

The equality of average vectors does not mean the coincidence of A and B, which are rather different (Fig. 3).

Fig. 3. An example of similar average vectors (A = a2 = bi = B) and totally different sets of vectors: {ai, 0,2, a3} and {61}

Consider a sentence Sw = (w1.. .w*... wn) containing a target word w* (denote it as w*). and a vector representation S = (v1 ...v*...vn) of Sw, where Wj is a word, Vj is a vector representation of Wj. Denote v* as v*. Suppose the target word w* has I senses. Denote by synW a synset corresponding to k-th sense, k = 1,...,l, syn'W = {wki,..., wkik}, where wkp are synonyms. Let synk = {vk1,... ,vkik} be a set of vector representations of synonyms wkp, p = 1,...,ik.

In what follows, we introduce a procedure of e-filtration, the idea of which is borrowed from the paper [2].

The synset filtration is the formation of a so called candidate set which consists of those synonyms whose similarity with the words from a sentence is higher than a similarity threshold e.

The first average algorithm 1, described below, uses average vectors of words of sentences and average vectors of the candidate set of synonyms in synsets.

This algorithm contains the following lines.

Line 1. Calculate the average vector of words of the sentence S

1 n

S = - E vj

n^ j j=i

Lines 3-6. Given e > 0, let us construct the filtered set of synonyms for each synset

candk(e) = {u G synk : u = v*, sim(u, v*) > e}.

Denote by sk(e) = Kcandk(e))| the power of a set candk (e).

Line 7. Calculate for sk (e) > 0 the average vector of the synset candidates

synk =

Sk (e)

E

u.

uEcandk (s)

If sk (e) = 0, then let s7ynk(e) be equal to the zero vector.

Line 8. Calculate the similarity of the average vectors of the sentence and the k-th filtered synset

simk (e) = sim(synk (e),S).

Line 10-11. Suppose maxk=1,...,i{simk(e)} = simfo* (e), i.e. k* G {1,..., 1} is the number of the largest simk(e). If k* is not unique, then take

1

Algorithm 1: Average algorithm with synonyms e-filtration

Data: v* - vector of the target word w* with I senses (synsets),

vi G S, S - sentence with the target word w*, v* G S, _

{synk} - synsets of the target word, that is synk 3 v*, k = 1,1. Result: k* G {1,..., 1} is the number of the sense of the word w* in the sentence S.

__n

1 S = n E , the average vector of words of the sentence S

3=1

do

8 9 10

take e > 0

foreach synset of the target word

foreach synk 3 v* do

construct the filtered set candk (e) of the synset synk:

candk(e) = {u G synk : u = v*, sim(u,v*) > £}

Sk= Icandk|, number of candidates of synonyms the average vector of synset candidates:

1 E u, if sk(e) > 0

Sk(e)

synk(e) = ^ uecandk{£)

0 , if sk(e) = 0

the similarity of average vectors of the sentence and the k-th filtered synset:

simk(e) = sim(synk(e), S) end

simk* (e) = max {simk (e)} ^ k* G {1, . . . , 1} , k* is the number of the largest sirn,k (e) k=1,...,l

ii while к* is not unique

another e > 0 and repeat the procedure from line 3.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Result: the target word w* has the sense corresponding to the k*-th synset synf*.

Remark: in_the case e = 0, we denote this algorithm as Ao-algorithm. In this case, the traditional averaging of similarity is used.

Note. A0-algorithm was used in our experiments, it was implemented in Python.1

Wiktiunary: служить sluzhit'

1. First meaning, set of synonyms: ■ : работать, :o work,

: помогать !o serve

у2

sy,

2. Second meaning, set of synonyms:

V : использоваться, to be used (for) у' 2 2

Vsyn : предназначаться о serve (for)

Wiktionary: entry у

1. First synset:

9'nl = {vs)in1»vgmI]

2. Second synset:

Ф2 rl

S'-ly

ù n ¿4=1, t

synl={(vL+vL)

syn^Kv^+v^J

syn, . 2

Fig. 4- Digest of the Wiktionary entry "служить" (sluzhit') and mean vectors syn\ and syn2 of the synonyms sets syn\, syn2 and the sentence S with this word w*

Ao-algorithm example

A simple example _and figures 4-6 will help to understand how this A0-algorithm works.

Take some dictionary word w2 with several senses and several synonym sets (for example, syn\ and syn2) and the sentence S with this word (Fig. 4). The task is to select a meaning (synset) of w2 (that is the target word is w*) used in the sentence S via the A0-algorithm.

Let us match the input data and the symbols used in the A0-algorithm. The word "служить" (sluzhit') corresponds to the vector v2.

1 See the function selectSynsetForSentenceByAverageSimilarity in the file https://github.com/componavt/ wcorpus.py/blob/master/src/test_synset_for_sentence/lib_sfors/synset_selector.py

152

2

3

4

7

Fig. 5. Sample source data are (1) vertices v\...v5 corresponding to words of the sentence S, the vertex v2 was excluded since it corresponds to the target word w%, and (2) the target word with two synsets syn\ and syn2 (Fig. 4), (3) vertices (vectors correspond to words) of the first synset are

[v1

syn\1 syn

vsym } and the second synset

iv1 v2

[ syn^ syn 2

}

Fig. 6. Similarity between the mean value of vectors of the sentence and the first synonym set is lower than the similarity with the second synset, that is sim(syn1; S) < sim{Wyn2, S). Thus, the second sense of the target word (the second synset syn2) will be selected in the sentence S by A0-algorithm

There is a dictionary article about this word in the Wiktionary, see Fig. 4 (a parsed database of Wiktionary is used in our projects).2

Two synonym sets of this Wiktionary entry are denoted by syn\ and syn2.

Mean values of the vectors corresponding to synonyms in these s_ynsets will be denoted as syn1 and syn2, and S is the mean vector of all vectors corresponding to words in the sentence S containing the word "служить" (sluzhit').

Average algorithm with sentence AND synonyms ^-FILTRATION (A£)

This algorithm 2 is a modification of algorithm 1. The filtration of a sentence is added to synset filtration. Namely, we select a word from the sentence for which the similarity with at least one synonym from the synset is higher than the similarity threshold e. Then, we average the set of selected words forming the set of candidates from the sentence. Let us explain algorithm 2 line by line.

Lines 2-5. Given e > 0, let us construct the set of words of the sentence S filtered by synonyms of the k-th synset synk

candkS(e) = {v e S : Зи e synk, sim(v, u) > e,

V = V*,U = V*}

Denote by Sk(e) = |candkthe power of the set candkS(e).

Line 6. Calculate the average vector of words of the filtered sentence

candkS(e) = c ^ Л ^ v

If Sk(e) = 0, then let candkS(e) be equal to the zero vector.

Lines 7-8. Construct filtered sets of synonyms

cand synk (e) = {u e synk : 3v G S, sim(u, v) > e,

u = v*,v = v*}.

Denote by Sk(e) = \cand synk(e)| the power of the k-th filtered synonym set.

Line 9. Calculate for Sk(e) > 0 the average vector of the k-th synset of candidates

cand synk (e) =

1

Sk (e)

£

u.

uEcand, synk(e)

If Sk(e) = 0, then cand synk(e) equals to the zero vector.

Line 10. Calculate the similarity of the average vectors of the filtered sentence and the k-th filtered synset

simk(e) = sim(candkS(e),cand synk(z))-

Lines 12-13. Suppose maxk=i,..,,i{simk(e)} = sim^* (e), i.e. k* G {1,...,/} is the number of the largest sirnk(e). If k* is not unique then take another e > 0 and repeat the procedure from line 2.

Result: the target word w* in the sentence S has the sense corresponding to the k*-th synset syn'%*.

This algorithm was implemented in Python.3

2See section "Web of tools and resources" on page 156.

3 See the function selectSynsetForSentenceByAverageSimilarityModified in the file https://github.com/ componavt/wcorpus.py/blob/master/src/test_synset_for_sentence/lib_sfors/synset_selector.py

Sk (e)

vEcandkS(e)

Algorithm 2: Average algorithm with sentence and synonyms e-filtration ( Ae)

Data: v* - vector of the target word w* with I senses (synsets),

vi e S, S - sentence with the target word w*, v* e S, _

{synk} - synsets of the target word, that is synk 3 v*, k = 1,1. Result: k* e {1,..., 1} is the number of the sense of the word w* in the sentence S.

do

10 11 12

take e > 0

foreach synset of the target word

foreach synk 3 v* do

construct the set of words of the sentence S filtered by synonyms of the k-th synset

synk:

candkS(e) = {v e S : 3u e synk, sim(v, u) > e,v = v*,u = v*}

Sk(&) = |CandkS(s)\, number of candidates of the sentence; the average vector of sentence candidates:

_ .sfe E ifSfc (e) > 0

candkS(e) = ^ vzcandks{e)

0 , if Sk (e) = 0

e-filtration of the synset synk by the sentence S:

c an d s ynk (e) = {u e s ynk : 3v e S, sim(u, v) > e,u = v* ,v = v*}

Sk(&) = \cand Synk(^)|, number of candidates of synonyms the average vectoc of synset candidates:

_ ( ^¡fe ^ u if Sfc(£) > 0

Cand Synk (e) = < uecand synk(e)

[0 , if s k (e ) = 0

the similarity of the average vectors of the sentence and the k-th filtered synset:

simk(e) = sim(candkS(e),cand synk(e)) end

Simk* (e) = max {simk(£)} ^ k* e {1, . . . , 1} , k* is the number of the largest sirnk(e) k=l,...,l

13 while k* is not unique

1

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

2

3

B

9

A-ALGORITHM BASED ON e-DILATATION

The algorithm 3 (A-algorithm) is based on the function K(A,B, e) (see previous section "New e-proximity between finite sets" on page 150), where A = s ynk, that is k-th synset, and B = S, where S is a sentence. The algorithm includes the following steps.

Lines 2-4. Given e > 0, let us construct the Ck (e) set of "near" words of the k-th synset and the sentence S.

Line 5. Denote by Dk(e) the set of "distant" words

Dk (e) = (S Us ynk ) \Ck (e).

Line 6. Calculate Kk(e) as the ratio of "near" and "distant" elements of the sets

Kk (£) =

\Ck (g)| 1 + \Dk (e)|'

Lines 8-9. Suppose maxk=i,..,iKk(e) = Kk* (e). If k* is not unique, then take another e > 0 and repeat the procedure from line 2.

Algorithm

-dilatation

3: K-algorithm based on

Data: * - vector of target word w* with

I senses (synsets), vi e S, v* e S, {synk} - synsets of v*, k = 1,1. Result: k* e {1,..., 1} is the number of the sense of the word w* in the sentence S.

1 do

2 take e > 0

foreach synset of the target word

foreach synk 3 v* do

set of near words:

Ck(e) = {u, v :

u e synk, v e S, sim(u, v) > e}

set of distant words:

Dk (e) = (S Us ynk) \Ck (e)

ratio of "near" and "distant" elements of the sets:

A (F) = \Ck(' Ak (t) = i+\Dh

end

get the number of the largest ratio k*

Kk*(s) = Imaaxi^k (e)

9 while k* is not unique

Result: the target word w* has the sense corresponding to the k*-th synset syn%t.

An example of constructing C and D sets is presented in Fig. 7 and Table. It uses the same source data as for the A0-algorithm, see Fig. 5.

Remark. This algorithm is applicable to the K-function described in the previous section3 as well. This algorithm was implemented in Python.4

More details for this example (Fig. 7) are presented in Table, which shows C and D sets with different e and values of the K-function.

Bold type of word-vertices in Table indicates new vertices. These new vertices are captured by a set of "near" vertices C and these vertices are

excluded from the set of "distant" vertices D with each subsequent dilatation extension with each subsequent e. For example, in the transition from e1 to e2 the set D2(e1) loses the vertex v3. During this transition e1 ^ e2 the set C2(e2) gets the same vertex v3 in comparison with the set C2 (e1).

In Fig. 8, the function K 1(e) shows the proximity of the sentence S and the synset syn1, the function K2(e) - the proximity of S and the synset syn2. It can be seen in Figure 8 that with decreasing e, the value of K2(e) grows faster than K 1(e).

Therefore, the sentence S is closer to the second synset syn2. The same result can be seen in the previous Fig. 7.

Fig. 7. An example of series of Ck(e) (sets of words of k-th synset which are close and near to the sentence S) in the ^-algorithm based on e-dilatation. The growth of the dilation of the vertices of the second synset {v^yn2 ,v2syn2} captures the vertices of the sentence S = {vi,v3,v4,v5} faster than the dilation of the vertices of the first synset. In other symbols: (syns + e) n S D (syn\ + e) n S. That is, according to the ^-algorithm, the second value of the word-vector vs, represented by the synset syns, will be selected for the sentence S

Fig. 8. Left-continuous step functions Ki(e), Ks(s) show that the sentence S is closer to the second synset syns

4 See the function selectSynsetForSentenceByAlienDegree in the file https://github.com/componavt/wcorpus.py/ blob/master/src/test_synset_for_sentence/lib_sfors/synset_selector.py

An example of the K-algorithm treating the word w2, which has two synsets syny, syn2 and the sentence S, where w2 € S, see Fig. 4. The number of the algorithm iteration corresponds to the index of e. Let the series of e be ordered so that 1 = e0 > ey > e2 > ...> e7 = -1. It is known that |Ci U Dy \ v2l = |S \ v2l = 6, that is the total number of words in the synsets and in the sentence are constants.

C2(£) D2(£) IC2I I D2j K 2(e)

Kk (£) = ICk(e)l 1+IDk(e)l

£o 0 Vl, V3, V4, V5, VlSyn2 , v'1yn2 0 6 0.0

£i i, 2 J syn2 ^ V4, V5, Vlsvn2 2 4 2 5

£2 2 y 1, y syn2 , V3 ^ V]yn2 3 3 3 4

£3 2 y syn2, V3, V4, V5 4 2 4 3

C\(e) Di (£) ICi I DI Ki(£)

£4 2 y syn , y 4 vlyni, Vu V3, V5 2 4 2 5

C2(e) D2(£) IC2I ID2I K 2(e)

£5 i, 2 , y syn2, ^ V5, 0 6 0 6

Cl(£) D (£) ICI DI Ki(e)

£ 6 2 ysyn% , ^4, V1 syn 1 Vl, V3, V5 3 3 3 4

Experiments

Web of tools and resources

This section describes the resources used in our research, namely: Wikisource, Wiktionary, WCorpus and RusVectores.

The developed WCorpus5 system includes texts extracted from Wikisource and provides the user with a text corpus analysis tool. This system is based on the Laravel framework (PHP programming language). MySQL database is used.6

Wikisource. The texts of Wikipedia have been used as a basis for several contemporary corpora [5]. But there is no mention of using texts from Wikisource in text processing. Wikisource is an open online digital library with texts in many languages. Wikisource sites contains 10 millions of texts7 in more than 38 languages.8 Russian Wikisource (the database dump as of February 2017) was used in our research.

Texts parsing. The texts of Wikisource were parsed, analysed and stored to the WCorpus database. Let us describe this process in detail. The database dump containing all texts of Russian Wikisource was taken from "Wikimedia Downloads" site.9 These Wikisource database files were imported into the local MySQL database titled "Wikisource Database" in Fig. 9,

where "WCorpus Parser" is the set of WCorpus PHP-scripts which analyse and parse the texts in the following three steps.

1. First, the title and the text of an article from the Wikisource database are extracted (560 thousands of texts). One text corresponds to one page on Wikisource site. It may be small (for example, several lines of a poem), medium (chapter or short story), or huge size (e.g. the size of the page with the novella "The Eternal Husband" written by Fyodor Dostoyevsky is 500 KB). Text preprocessing includes the following steps:

• Texts written in English and texts in Russian orthography before 1918 were excluded; about 12 thousands texts were excluded.

• Service information (wiki markup, references, categories and so on) was removed from the text.

• Very short texts were excluded. As a result, 377 thousand texts were extracted.

• Texts splitting into sentences produced 6 millions of sentences.

• Sentences were split into words (1.5 millions of unique words).

5https://github.com/componavt/wcorpus

6See WCorpus database scheme: https://github.com/componavt/wcorpus/blob/master/doc/workbench/db, scheme.png

7https://stats.wikimedia.org/wikisource/EN/TablesWikipediaZZ.htm 8https://stats.wikimedia.org/wikisource/EN/Sitemap.htm 9https://dumps.wikimedia.org/backup-index.html

Fig. 9. The architecture of WCorpus system and the use of other resources

3. Secondly, word forms were lemmatized using phpMorphy10 program (0.9 million lemmas).

4. Lastly, lemmas, wordforms, sentences and relations between words and sentences were stored to WCorpus database (Fig. 9).

In our previous work "Calculated attributes of synonym sets" [6] we also used neural network models of the great project RusVectores11, which is a kind of a word2vec tool based on Russian texts [9].

Context similarity algorithms evaluation

In order to evaluate the proposed WSD algorithms, several words were selected from a dictionary, then sentences with these words were extracted from the corpus and tagged by experts.

Nine words

Only polysemous words which have at least two meanings with different sets of synonyms are suitable for our evaluation of WSD algorithms.

The following criteria for the selection of synonyms and sets of synonyms from Russian Wiktionary were used:

1. Only single-word synonyms are extracted from Wiktionary. This is due to the fact that the RusVectores neural network model "ruscorpora_2017_1_600_2" used in our research does not support multiword expressions.

2. If a word has meanings with equal sets of synonyms, then these sets were skipped because it is not possible to discern different meanings of the word using only these synonyms without additional information.

https://packagist.org/packages/componavt/phpmorphy

11http://rusvectores.org/en/

12http://whinger.krc.karelia.ru/soft/wikokit/index.html

13https://github.com/componavt/piwidict

14See information about the subcorpus in the section "Sentences of three Russian writers" on page 158.

A list of polysemous words was extracted from the parsed Russian Wiktionary12 using PHP API piwidict13 (Fig. 9).

Thus, 9 polysemous Russian words (presented in the subcorpus14) were selected by experts from this Wiktionary list, namely: "бездна" (bezdna), "бросать" (brosat'), "видный" (vidnyy), "до-

нести" (donesti), "доносить" (donosit'), "занятие" (zanyatiye), "лихой" (likhoy), "отсюда" (otsyuda), "удачно" (udachno). The tenth word "служить" (sluzhit') was left out of consideration, because there are 1259 of 1308 sentences with this frequent word to be tagged by experts in the future (Fig. 10).

Fig. 10. Russian verb "служить" (sluzhit') has seven meanings and seven synsets in the developed system WCorpus. 49 sentences are already linked to relevant senses of this verb. 1259 sentences remain to be tagged by experts

Sentences of three Russian writers

The sentences which contain previously defined 9 words were to be selected from the corpus and tagged by experts. But the Wikisource corpus was too huge for this purpose. So, in our research a subcorpus of Wikisource texts was used. These are the texts written by Fyodor Dostoevsky, Leo Tolstoy and Anton Chekhov.

Analysis of the created WCorpus database with texts of three writers shows that the subcorpus contains:15

• 2635 texts;

333 thousand sentences;

• 215 thousand wordforms;

• 76 thousand lemmas;

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

4.3 million wordform-sentence links;

Texts of this subcorpus contain 1285 sentences with these 9 words, wherein 9 words have in total 42 synsets (senses). It was developed A graphical user interface (webform) of the WCorpus system (Fig. 10) was developed, where experts selected

one of the senses of the target word for each of the 1285 sentences.

This subcorpus database with tagged sentences and linked synsets is available online [7].

Text processing and calculations

These 1285 sentences were extracted from the corpus. Sentences were split into tokens. Then wordforms were extracted. All the wordforms were lowercase and lemmatized. Therefore, a sentence is a bag of words. Sentences with only one word were skipped.

The phpMorpy lemmatizer takes a wordform and yields possible lemmas with the corresponding part of speech (POS). Information on POS of a word is needed to work with the RusVectores' prediction neural network model "ruscorpora_2017_1_600_2", because to get a vector it is necessary to ask for a word and POS, for example "serve_VERB". Only nouns, verbs, adjectives and adverbs remain in a sentence bag of words, other words were skipped.

The computer program (Python scripts) which works with the WCorpus database and RusVectores was written and presented in the

5See SQL-queries applied to the subcorpus https://github.com/componavt/wcorpus/wiki/SQL 6https://github.com/componavt/wcorpus.py

158

form of the project wcorpus.py at GitHub.16 The source code in the file synset_selector.py17 implements three algorithms described in the article, namely:

• A0-algorithm implemented in the function selectSynsetForSentenceByAverageSimila-rity();

• K-algorithm - function selectSynsetForSen-tenceByA lienDegree();

• Ae-algorithm - function selectSynsetForSen-tenceByAverageSimilarityModified().

These three algorithms calculated and selected one of the possible synsets for each of 1285 sentences. _

Two algorithms (K and Ae) have an input parameter of e, therefore, a cycle with a step of

0.01 from 0 to 1 was added, which resulted in 100 iterations for each sentence.

Then, answers generated by the algorithms were compared with the synsets selected by experts.

The number of sentences with the sense correctly tagged by the K-algorithm for nine Russian words presented in Fig. 11.

The legend of this figure lists target words with numbers in brackets (X, Y), where X is the number of sentences with these words, Y is the number of senses.

The curves for the words "ЗАНЯТИЕ" ("ZANYATIYE", solid line with star points) and "ОТСЮДА" ("OTSYUDA", solid line with triangle points) are quite high for some e, because (1) there are many sentences with these words (352 and 308) in our subcorpus, (2) these words have few meanings (3 and 2).

0 0.2 0.4 0.6 £■

Fig. 11. Number of sentences with the correct tagged sense for nine Russian words by the K-algorithm

17https://github.com/componavt/wcorpus.py/blob/master/src/test_synset_for_sentence/lib_sfors/ synset_selector.py

Fig. 12. Normalised data with the fraction of sentences with correctly tagged sense for nine Russian words

More meanings, poorer results.

If a word has more meanings, then the algorithm yields even poorer results. It is visible in the normalised data (Fig. 12), where examples with good results are "ОТСЮДА" (OTSYUDA) and "ЛИХОЙ" (LIKHOY, dash dot line with diamond points) with 2 meanings; the example "БРОСАТЬ" (BROSAT', bold dotted line) with 9 meanings has the worst result (the lowest dotted curve).

Comparison of three algorithms

Let us compare three algorithms by summing the results for all nine words. Fig. 13 contains the

following curves: A0-algorithm - long dash line; K-algorithm - solid line; Ae-algorithm - dash line.

The A0-algorithm does not depend on e. It showed mediocre results.

_ The K-algorithm yields better results than Ae-algorithm when e > 0.15.

The K-algorithm showed the best results on the interval [0.15; 0.35]. Namely, more than 700 sentences (out of 1285 human-tagged sentences) were properly tagged with the K-algorithm on this interval (Fig. 13).

0 0.2 0.4 0.6 0.8 e

Fig. 13. Comparison of A0-algorithm, K-algorithm, Ae-algorithm

160

Comparison of four algorithms as applied to nine words

Let us compare the results of running four algorithms for each word separately (Fig. 14): A0-algorithm - long dash line with triangle points; K-algorithm - solid line with square points; A£-algorithm - dash line with circle points; "Most frequent meaning" - dashed line with X marks.

The simple "most frequent meaning" algorithm was added to compare the results. This algorithm does not depend on the variable e, it selects the meaning (synset) that is the most frequent in our corpus of texts. In Fig. 14 this algorithm corresponds to a dashed line with X marks.

The results of the "most frequent meaning" algorithm and A0-algorithm are similar (Fig. 14).

The K-algorithm is the absolute champion in this competition, that is for each word

there exists an e such that the K-algorithm outperforms other algorithms (Fig. 14).

Let us explain the calculation of the curves in Fig. 14. _

For the A0-algorithm and the "most frequent meaning" algorithm, the meaning (synset) is calculated for each of the nine words on the set of 1285 sentences. Thus, 1285 ■ 2 calculations were performed.

And again, the Ae-algorithm and the K-algorithm depend on the variable e. But how can the results be shown without the e axis? If at least one value of e gives a positive result, then we suppose that the WSD problem was correctly solved for this sentence by the algorithm.

The value on the Y axis for the selected word (for Ae-algorithm and K-algorithm) is equal to the sum of such correctly determined sentences (with different values of e) in Fig. 14.

Lemma (number of sentences, number of meanings) Transliteration

Fig. 14. Comparison of A0-algorithm, ^-algorithm, Ae-algorithm and the most frequent meaning

Perhaps it would be more correct to fix e corresponding to the maximum number of correctly determined sentences. Then, the result will not be so optimistic.

To show the complexity of comparing and evaluating e-algorithms (that is, algorithms that depend on e), let us try to analyze the results of the K-algorithm, shown in Fig 15.

The percentage (proportion) of correctly determined 1285 sentences for 9 words by the K-algorithm, where the e variable changes from 0

to 1 in increments of 0.01, is presented in Fig. 15. Thus, 1285 ■ 100 calculations were performed.

These proportions are distributed over a set of possible calculated results from 0% (no sentence is guessed) to 100% (all sentences are guessed) for each of nine words.

This Figure 15 does not show which e-values produce better or poorer results, although it could be seen in Figures 11-13. But the Figure does show the set and the quality of the results obtained with the help of the K-algorithm. For

example, the word "лихой" (likhoy) with 22 sentences and 100 different e has only 8 different outcomes of the K-algorithm, seven of which lie in the region above 50%, that is, more than eleven sentences are guessed at any .

For example, the word "бросать" (brosat') has the largest number of meanings in our data set, it has 9 synonym sets in our dictionary and 11 meanings in Russian Wiktionary.18 All

possible results of the K-algorithm for this word are distributed in the range of 10-30%. The maximum share of guessed sentences is 30.61%. Note that this value is achieved when e = 0.39, and this is clearly shown in Figure 12, see the thick dotted line.

All calculations, charts drawn from experimental data and results of the experiments are available online in Google Sheets [8].

Fig. 15. Proportions of correctly guessed sentences distributed over a set of possible calculated results

Conclusions The study was supported by the Russian

Foundation for Basic Research, grant No. 18-012-00117.

The development of the corpus analysis system WCorpus19 was started. 377 thousand texts were extracted from Russian Wikisource, processed and uploaded to this corpus.

Context-predictive models of the RusVectores project are used to calculate the distance between lemmas. Scripts in Python were developed to process RusVectores data, see the wcorpus.py project on the GitHub website.

The WSD algorithm based on a new method of vector-word contexts proximity calculation is proposed and implemented. Experiments have shown that in a number of cases the new algorithm shows better results.

The future work is matching Russian lexical resources (Wiktionary, WCorpus) to Wikidata objects [11].

18https://ru.wiktionary.org/wiki/6pocaTb

19https://github.com/componavt/wcorpus

References

1. Arora S., Liang Y, Ma T. A simple but tough-to-beat baseline for sentence embeddings. In Proceedings of the ICLR, 2017. P. 1-16. URL: https://pdfs.semanticscholar.org/ 3fc9/7768dc0b36449ec377d6a4cad8827908d5b4. pdf (access date: 3.04.2018).

2. Chen X., Liu Z, Sun M. A unified model for word sense representation and disambiguation. In Proceedings of the EMNLP, 2014. P. 10251035. doi: 10.3115/v1/d14-1110. URL: http: //www.aclweb.org/anthology/D14-1110 (access date: 3.04.2018).

3. Choi S. S, Cha S. H, Tappert C. C. A survey of binary similarity and distance measures. Journal of Systemics, Cybernetics and Informatics. 2010. Vol. 8. no. 1. P. 43-48. URL: http://citeseerx. ist.psu.edu/viewdoc/download?doi=10.1.1.

352.6123&rep=rep1&type=pdf (access date: 3.04.2018).

4. Haussler D. Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. 1999. URL: https://www.soe.ucsc. edu/sites/default/files/technical-reports/ UCSC-CRL-99-10.pdf (access date: 3.04.2018).

5. Jurczyk T., Deshmane A., Choi J. Analysis of Wikipedia-based corpora for question answering. arXiv preprint arXiv:1801.02073. 2018. URL: http://arxiv.org/abs/1801.02073 (access date: 3.04.2018).

6. Krizhanovsky A., Kirillov A. Calculated attributes of synonym sets. arXiv preprint arXiv:1803.01580. 2018. URL: http://arxiv.org/ abs/1803.01580 (access date: 3.04.2018).

7. Krizhanovsky A., Kirillov A., Krizhanovskaya N. WCorpus mysql database with texts of 3 writers. figshare. 2018. URL: https://doi.org/10.6084/ m9.figshare.5938150.v1 (access date: 3.04.2018).

8. Krizhanovsky A., Kirillov A., Krizhanovskaya N. Assign senses to sentences of 3 writers. Google

СВЕДЕНИЯ ОБ АВТОРАХ:

Кириллов Александр Николаевич

ведущий научный сотрудник, д. ф.-м. н.

Институт прикладных математических исследований

КарНЦ РАН, Федеральный исследовательский центр

«Карельский научный центр РАН»

ул. Пушкинская, 11, Петрозаводск,

Республика Карелия, Россия, 185910

эл. почта: [email protected]

тел.: (8142) 766312

Крижановская Наталья Борисовна

ведущий инженер-исследователь

Институт прикладных математических исследований

КарНЦ РАН, Федеральный исследовательский центр

«Карельский научный центр РАН»

ул. Пушкинская, 11, Петрозаводск,

Республика Карелия, Россия, 185910

эл. почта: [email protected]

тел.: (8142) 766312

Крижановский Андрей Анатольевич

рук. лаб. информационных компьютерных технологий, к. т. н.

Институт прикладных математических исследований

КарНЦ РАН, Федеральный исследовательский центр

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

«Карельский научный центр РАН»

ул. Пушкинская, 11, Петрозаводск,

Республика Карелия, Россия, 185910

эл. почта: [email protected]

тел.: (8142) 766312

Sheets. 2018. URL: http://bit.ly/2I14QIT

(access date: 27.04.2018).

9. Kutuzov A., Kuzmenko E. Texts in, meaning out: neural language models in semantic similarity task for Russian. arXiv preprint arXiv:1504.08183. 2015. URL: https://arxiv.org/abs/1504.08183 (access date: 3.04.2018).

10. Lesot M-J., Rifqi M., Benhadda H. Similarity measures for binary and numerical data: a survey. International Journal of Knowledge Engineering and Soft Data Paradigms. 2009. Vol. 1. no. 1. P. 63-84. doi: 10.1504/ijkesdp.2009.021985. URL: http://citeseerx.ist.psu.edu/viewdoc/ download?doi=10.1.1.212.6533&rep=rep1& type=pdf (access date: 3.04.2018).

11. Nielsen F. Linking ImageNet WordNet Synsets with Wikidata. In WWW '18 Companion: The 2018 Web Conference Companion. 2018. URL: https://arxiv.org/pdf/1803.04349.pdf (access date: 18.04.2018).

Received March 31, 2018

CONTRIBUTORS:

Kirillov, Alexander

Institute of Applied Mathematical Research,

Karelian Research Centre,

Russian Academy of Sciences

11 Pushkinskaya St., 185910 Petrozavodsk,

Karelia, Russia

e-mail: [email protected]

tel.: (8142) 766312

Krizhanovskaya, Natalia

Institute of Applied Mathematical Research,

Karelian Research Centre,

Russian Academy of Sciences

11 Pushkinskaya St., 185910 Petrozavodsk,

Karelia, Russia

e-mail: [email protected]

tel.: (8142) 766312

Krizhanovsky, Andrew

Institute of Applied Mathematical Research, Karelian Research Centre, Russian Academy of Sciences 11 Pushkinskaya St., 185910 Petrozavodsk, Karelia, Russia

e-mail: [email protected] tel.: (8142) 766312

i Надоели баннеры? Вы всегда можете отключить рекламу.