Analysis of statistical methods for stable combinations determination of keywords identification

Lytvyn V.; Vysotska V.; Hrendus M.; Uhryn D.; Naum O.

-□ □-

Розглянуто oco6Mueocmi застосуван-ня технологш NLP, Information Retrieval, SEO та Web-mining для визначення стш-ких словосполучень при iдентифiкацiï клю-чових слiв в опрацюванш Web-ресурЫв. Лтгвостатистичний аналiз природомов-ного тексту використовуе переваги кон-тент-моштортгу на основi методiв NLP для iдентифiкацiï стшких словосполучень. Квантитативний аналiз стшких словоспо-лучень використано для визначення степе-ня приналежностi множит ключових ^iв. Запропоновано метод визначення стшких словосполучень при iдентифiкацiï ключових слiв укратомовного контенту

Ключовi слова: стшке словосполучення, NLP, Information Retrieval, SEO, Web-mining, статистичний лтгв^тичний аналiз, квантитативна лтгв^тика, рубрикащя □-□

Рассмотрены особенности применения технологий NLP, Information Retrieval, SEO и Web-mining для определения устойчивых словосочетаний при идентификации ключевых слов в разработке Web-ресурсов. Лингвостатистический анализ естественноязыкового текста использует преимущества контент-мониторинга на основе методов NLP для идентификации устойчивых словосочетаний. Квантитативный анализ устойчивых словосочетаний использован для определения степени принадлежности множеству ключевых слов. Предложен метод определения устойчивых словосочетаний при идентификации ключевых слов украиноязычного контента

Ключевые слова: устойчивое словосочетание, NLP, Information Retrieval, SEO, Web-mining, статистический лингвистический анализ, квантитативная лингвистика, рубрикация -□ □-

1. Introduction

In modern intellectual systems of linguistic nature, it is important to determine effectively stable word combinations for identifying a set of keywords while processing Web resources [1]. An optimally appropriate set of stable word combinations is used for information retrieval (IR), SEO and Web-mining technologies, natural language processing (NLP) and automatic machine translation. It is also essential to identify the content by using specific natural language texts and rubrics as well as reflexive and automatic analysis of comments on published products. A new direction is the automatic processing of texts while integrating data from various sources of different fields, including Internet tourism [2]. Stable word combinations are used in algo-

UDC 004.89

|DOI: 10.15587/1729-4061.2018.126009|

ANALYSIS OF STATISTICAL METHODS FOR STABLE COMBINATIONS DETERMINATION OF KEYWORDS IDENTIFICATION

V. Lytvyn

Doctor of Technical Sciences, Professor* E-mail: [email protected] V. Vysotska PhD, Associate Professor* E-mail: [email protected] D. U h ryn PhD, Associate Professor Department of Information Systems Chernivtsi Faculty of National Technical University «Kharkiv Polytechnic Institute» Holovna str., 203A, Chernivtsi, Ukraine, 58000 E-mail: [email protected] M. Hrendus Assistant* E-mail: [email protected] O. Naum Assistant

Department of Information Systems and Technologies Drohobych Ivan Franko State Pedagogical University I. Franko str., 24, Drohobych, Ukraine, 82100 E-mail: [email protected] *Department of Information Systems and Networks Lviv Polytechnic National University S. Bandery str., 12, Lviv, Ukraine, 79013

rithms for correct tokenization, compiling dictionaries (lexicography), automatic translation, learning foreign languages (мщний чай - strong tea « мщний сон - fast sleep), and distinguishing terminology [3].

Analysis of stable word combinations is used for identifying relevant content, indexing in IR, tokenization, content categorization, creating a search image of some content, and constructing thematic ontologies [4]. Usually, this work is the prerogative of the person who is the moderator of Web-resources [5]. Automating the process of extracting data or knowledge of natural language content using NLP methods greatly reduces the time and the amount of Web resources to obtain the desired result [6]. The use of methods of artificial intelligence (AI) in linguistic processing of natural language texts is usually effective after qualitative

ic V. j.ijtvijh. V. Uijsotska. D. uhryn M. Hrehdus. O. Naum, 20X8

morphological and syntactic parsing of these texts [7]. If for English texts these questions are easily solved by a simple parser and using the Porter algorithm, then for Slavic languages, including Ukrainian texts, it is not so easy [8]. Therefore, there appears the problem of choosing an optimal statistical method for determining stable word combinations for identifying keywords in the development of Ukrainian-language Web resources [9].

The use of knowledge engineering for effective NLP improves the quality of the results of research on texts [10]. This entails developing new NLP approaches and techniques, including the automatic determination of stable word combinations for identifying keywords when processing Web resources [11]. With the proliferation of Internet services and their introduction into everyday life of every ordinary person, there appears redundancy of information as an IR result. The so-called informational noise negatively affects both the Internet business and the irritability of a regular user of these services. The daily IR results are the following: Google>8 billion pages, Yandex>600 million pages, and 2.5 million sites [12]. Therefore, the qualitative and optimal definition of stable word combinations as a set of keywords in Ukrainian and English texts will significantly reduce the time for receiving relevant content search results in response to user queries.

2. Literature review and problem statement

Modern NLP methods are increasingly used not only in AI and computer linguistics, but in Internet environments, especially in the IR direction (Fig. 1). Today in IR, it implies not only representation, storage, organization and access to information elements. It also focuses on the needs of regular and potential users in on-line information and the emphasis on finding important relevant content (and not data, Fig. 2) [13].

document or content search image (logical representation). This, in turn, affects the efficiency of presenting relevant information on the Internet [14].

Dictionary

Content NLP

Web mining

Computer Science

Artificial Intelligence Natural Language — ^-^Processing

Computer Linguistics

Fig. 1. The topicality and perspective of automatic processing of texts

User interface

ьД ' user feedback

1 I .J-

I —W_Request processing

Text processing

organized / documents

Fig. 2. The process of information search according to Baeza-Yates & Ribeiro-Neto (1999)

The main models and methods of IR are indexing, the Boolean model, the vector model and the evaluation of the search quality [14]. The average complexities of a direct search (Brute Force, O(n+m)) and a complex search (Dboy-er-Moore, O(n/m)) [14] were experimentally tested long ago. The effectiveness of the indexing method is directly proportional to the effectiveness of the process of creating a

>2 ^ 4 ^ 8 ^ 16 ^ 32 ^ 64 ^ 128, >1 ^ 2 ^ 3 ^ 5 ^ 8 ^ 13 ^ 21, >13 ^ 16.

Postings

The effectiveness of any NLP method depends directly on the quality of the prior processing of the text content. This, in turn, depends on extracting and/or receiving text (HTML, PDF...) [15]; coding and language [16]; breakdown into words and sentences (tokenization); elimination of stop words [17]; and stemming as determining the word form [18]. Tokeniza-tion is a process of demarcating and classifying sections of a series of input characters for the desired content:

- dates, numbers (13/03/2014, 1415);

- adverbs (Ukr. нарешп, зазвичай, вщтод^ поим, наприклад);

- introductory words (шшими словами, в тдсумок скажемо, мiж шшим);

- prepositions (напередодш, незважаючи на);

- particles (все ж таки, немов би, немов як, до того ж, шби то як);

- verbose tokens (collocations, Улан-Уде, Нью-Йорк, 1ван 1ванович);

- boundaries of sentences ("I. I. 1ванов пршхав в м. Львiв минуло! зими.").

The resulting tokens are then subjected to a different form of processing. The process is considered as a subtask for analysing input data [19].

Stop-words (or noise words) are words that do not carry a meaningful load, so their usefulness and role during searching are not significant. A text, in its turn, is an unstructured set of meaningful words ("bag of words"), where stop-words belong to the functional parts of speech, that is, they are prepositions, conjunctions, particles (а, га, ай, ау, ах, ба, без, поблизу, брр, зась, шби, б, бути, в, ви, ваш, поблизу, вглиб, до того ж, уздовж, адже, заметь, заметь, поза, усередиш, як, бшя, навколо, геть,...) [20].

An effective IR model is highly dependent on the quality and method: presentation of text files and content [21]; setting informational needs (queries) of users; estimation of the proximity between the query and the document [22].

The Boolean model of IR considers content as a plurality of words (terms) and a query as a Boolean expression: "(кшка OR пес) AND корм"; "птаха ANDNOT вшськовий" [14]. Processing a query in this model is the operation on sets that correspond to words (terms) (Table 1).

Table 1

An example of the Boolean model (keywords in articles found as to [23])

BM/Article [24] [251 [261 [271 [281 [291

Content 1 1 0 0 0 1

NLP 1 1 0 1 0 0

Web-mining 1 1 0 1 1 1

IR 0 1 0 0 0 0

SEO 1 0 0 0 0 0

Web resources 1 0 1 1 1 1

Keywords 1 0 1 1 1 0

text

The advantages of the Boolean model are simplicity and convenience for those who are familiar with logical operators. The disadvantage is that this model is too "contrast-based" (in terms of both content submission and its relevance).

The Vector model presents IR content and query as vectors in the space of words (terms), where the vector component is the meaning of a word for the document (query). The model uses a measure of proximity (ranking). This is the cosine of the angle between the vectors (Fig. 3):

-_ V di ■ qi sim(d, q) =

f =

f

J n

max fk;

N

idf = log— W = tf, ■ idf.

A modified version of TFxIDF [14] is

tfidfd (I )=b+(1 -b) -tfD (i ) ■ idfD (i ),

fr^D (l )

^(0-

freqD(l) + 0.5 +1.5- D

relevant signs in the answer, b denotes all the signs in the answer, and c is all relevant signs (Fig. 5).

d |q|

where di is the weight of a term in the content (frequency of use in the content/collection); qi is the weight of the term i in the query. A well-known example of the vector model is the approach TFxIDF (TF is term frequency, IDF is inverse document frequency). The basic version of TFxIDF [14] is

avg _ di

Fig. 4. The method of general information search pool

0,8 -0,6 -0,4 -0,2 -

0 -J-

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Fig. 5. The 11-point graph of P/R information retrieval results

The main initiators of the IR evaluation methods are (Fig. 4, 5) the following: TREC (Text Retrieval Evaluation Conference, trec.nist.gov) and CLEF (Cross-Language Evaluation Forum, www.clef-campaign.org).

idf (l ) = -

|c| + 0.5 df (l )

log(|c| +1) '

where avg_dl is the average length of a document, and c is the size of the collection P=0...1.

A j dj

Fig. 3. The vector pattern of IR

The advantage of a vector model is the effectiveness of processing primary static collections. It also involves partial coincidence. The disadvantage of the model is that it is easily attacked (spammed) and does not work well on short texts. However, the Web is an uncontrolled collection (Fig. 4), large volumes of content, its various formats, variety (language, themes, etc.), high competition (spam), present clicks and links (PageRank). Therefore, the two previous IR models do not resolve the problem of the quality of searching for relevant content. The basis of the quality assessment of IR is the notion of relevance (compliance with the information needs) of the content sought. It necessarily contains the following signs (features): precision ( p = a / b), recall ( r = a / c ), and F-measures (F = (p + r)/ 2pr), where a represents the

3. The aim and objectives of the study

The aim of the work is to analyse statistical methods for developing an optimal approach to determining stable word combinations in identifying keywords when developing and processing Ukrainian-language Web-resources based on the technology of computational linguistics.

To achieve this aim, the following tasks are set and done:

- to develop a method for determining stable word combinations while identifying keywords in Ukrainian-language texts based on the analysis of lexical speech coefficients in standard content fragmentation;

- to devise a formal approach to designing content monitoring software to determine stable word combinations when identifying keywords in Ukrainian texts based on Web Mining and NLP;

- to obtain and analyse the results of experimental testing of the proposed content-monitoring method for determining stable word combinations in identifying keywords in Ukrainian-language scientific texts on technical matter.

4. The method for determining stable word combinations when identifying keywords for text content

The method for determining stable word combinations consists of the following phases: morphological analysis (MA), syntactic analysis (SA), keyword selection, and stability analysis of word combinations from a multitude of keywords.

q

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Q

Fig. 6 lists the main steps in determining stable word combinations when identifying keywords for text content.

Fig. 6. A flowchart of linguistic analysis of Ukrainian-language texts to identify stable word combinations as keywords

Stage 1. The aim of the MA is to define the keyword equivalence classes in IR [30]. MA methods for identifying keywords are procedural, tabular, and statistical stemming or their various combinations. One of the well-known MA algorithms is the Porter stemmer [31], which is the stemming algorithm published by Martin Porter in 1980. The original version of the Stemmer is for English and written in BCPL. Stemming is the process of reducing the word to the stem by rejecting it auxiliary parts, such as an inflexion or suffix. For Ukrainian texts in MA, it is best to use combinations of approaches such as procedural, tabular and statistical stemming [32]. In the procedural approach of MA, the emphasis is placed on analysing words of the dictionaries of stems and full forms dictionaries (FFDs). Then the MA algorithm consists of three main stages: the search in the FFD, the selection of the stem and the search for the stem in the dictionary. Examples of the tabular approach are вовка — вовк (masc., animate, sg., [gen.|mean.]); не — не (particle); годуй — годувати (imperf., imperat., sg.); в — в (preposition). An example of the model for a Ukrainian word change is лев masc. 1*b (animal); лев masc. 1*a (currency); стриже imperf. 8*b (-г-); гостьова femin. 4а (п). The basis of most of machine MA of the Ukrainian language is a tree or a finite state automaton (Fig. 7) [33].

The statistical stemmer is based on the probability of determining the stem of the word, for example: словниками — словник — словник-ами — никами; сокирами — сокира — сокир-ами — ир-ами; лиаючого — лиати — лгг-аючого — iT-аючого; лиаючого — лггаючий — лггаюч-ого — юч-ого. The main rule is one vowel in the stem of the word. The types of words are determined by the forms of their inflexions (Fig. 8).

Features of the algorithm. The algorithm works Fig. 7 with separate words, so the context in which the

word is used is unknown. Other unavailable categories of linguistics are the word structure (root, suffix, etc.) and the part of speech (noun, adjective, etc.). We currently have the following techniques for analysing words:

- the ending in removed from the word, for example, the ending увати transfers the word критикувати into критик;

- the word has a stable ending: the words with this ending are left unchanged, for example, ск and the invariable words блиск, тиск, обелгск, etc.;

- the word changes the ending, but this rule applies to words in which certain letters drop out (ядро and ядер, where the ending ер changes to p) or change (чоловгк and чоловгче, where к changes into ч);

- the word corresponds to a stable expression: this is an attempt to combine several rules into one complex, for example, in the code there are expressions similar to (ов)*у ва(в|вши|вшись|ла|ло|ли|ння|нш|нням|нню|ти|вся|всь|л ись|лися|тись|тися);

- the word does not change during its stemming, but there is an exception to the rules: it is necessary to maintain a dictionary of exception words, for example, вгче, наче;

- the word changes with stemming, but it is also an exception: it is necessary to keep in the dictionary at once two forms of the word (original and schematised), for example, the word вгдер should change to вгдр, although other words ending in ер are not stemmed so (авгадиспетчер, вгтер, гравер, etc.);

- the short words remain unchanged: the functional parts of speech (prepositions, conjunctions, particles) are usually very short words and ignored by the algorithm (words up to 2 letters inclusive).

ОЧУ^О^ЮЧУ^©

.¿r

О-^УКУКУКУХ^©

@ я>@

О^УКУХУКУ^©

b

Methods for storing MA results: a is a tree and b is the Finite State Automata, FSA

а

var SADJECTIVE =

У(ими|ш|ий|а|е|ова|ове|1в|е|ш|ее|ее|я|1м|ем|им|1м|их|1х|ою|йми|1ми|у|ю|ого|ому|о1)$/' //http://uk.wikipedia.org/wiki/npHKMeTHHK + http ://wapedia.mobi/uk/npiiKMeTHHK var SPARTICIPLE = 7(ий|ого|ому|им|1м|а|ш|у|ою|ш|1|их|йми|их)$/'; //http://uk.wikipedia.org/wiki/flienpHKMeTHHK

var SVERB = '/(сь|ся|ив|ать|ять|у|ю|ав|али|учи|ячи|вши|ши|е|ме|ати|яти|е)$/'; //http://uk.wikipedia,0rg/wiki^iecfl0B0 var SNOLTN =

'/(а|ев|ов|е|ями|ами|еи|и|ей|ой|ий|й|иям|ям|ием|ем|ам|ом|о|у|ах|иях|ях|ы|ь|ию|ью|ю|

ия|ья|я|1|ов1|1|ею|ею|ою|е|ев1|ем|ем|1в|1в|\'ю)$/';

//http://uk.wikipedia.org/wiki/lMeHHHK

A static tree of endings the total proportion of which is less than 1 %

р (2,709) ч(959) г(636) п(341) щ(110)

н (2,531) с(914) з(581) б(281) Ц (34)

д (1,038) л(754) ж(353) Ф(214) г (4)

Fig. 8. Definition of the word type by the inflexion form

All these techniques are used for groups that generate and illustrate the rules of stemming. However, this greatly complicates the search algorithm for keywords. First, it is necessary to take into account the widespread endings (not the traditional inflexions, as part of the word), that is, the sequence of letters in which a word ends. Tables 2, 3 contain endings of words from 1 to 4 letters in length. Five or more letters are not given, as there are few such words (for 5 maximum йтесь (6,837), for 6 (4,656), etc.). This is a peculiar map for the stemming project. For the effectiveness of the search algorithm, it is necessary to construct a static tree of endings and to cover all branches of the tree [34]. The level of the tree detail varies within 500600 words with a common ending. T = {система, рубрикувати, укратомовний, контент, за, ключовий, слово},S.

Stage 2. Syntax represents the rules of combining words in correct expressions such as word combinations and sentences [35]. The task of a SA (syntactic analyser, parser) is to construct the syntactic structure of an input sentence [36]. The aspects of implementing the SA are dictionaries (data on individual units of language); formal rules and interaction with adjacent levels of processing (MA, semantic analysis). Often, the SA uses the Context-free grammar (CFG) rules: <N, T, X, R>, where N is the set of nonterminal characters, T is the set of terminal characters (NnT = 0), X is the axiom (X eN), R is the set of rules of transformation (substitutions) of type Y ^ a, where Y eN, a is a list of terminal and non-terminal characters. An example of the CFG:

N = {S, NP, PP,V, N, A},

Table 2

A static table of common endings in the Ukrainian language

я (164,062) тися (10,379) мось (20,536) али (10,666) ному (19,112) овi (17,191) а (68,134) их (31,127)

ся (148,160) лися (10,338) лось (10,231) ними (19,089) о (90,454) с-ri (8,731) на (21,328) ах (20,023)

ня (9,765) теся (19,103) тись (10,366) м (119,779) мо (33,568) ос-ri (7,636) ла (17,945) ях (9,855)

ося (30,769) лася (10,230) лись (10,337) т (2,980) го (31,445) ю (80,877) ка (11,029) них (19,092)

ься (25,211) ь (151,355) тесь (19,105) iм (31,343) ло (17,238) ою (39,616) ютю (7,598) 1 (34,702)

ися (21,940) сь (111,459) лась (10,229) им (31,166) ймо (11,229) ню (10,075) й (77,109) о! (31,421)

еся (19,105) ть (33,055) ють (7,606) ам (20,154) емо (11,136) ною (20,280) ш (33,241) но! (19,098)

шся (11,775) ось (30,788) и (123,402) ом (17,018) ого (31,389) кою (7,497) ий (31,136) в (32,681)

ася (10,235) ись (22,656) ми (62,080) ям (15,717) ало (10,465) нню (9,054) ала (10,610) iB (15,898)

вся (10,076) есь (19,114) ти (20,025) шм (19,333) ного (19,090) стю (7,648) е (66,988) ав (10,547)

юся (8,044) ась (10,239) ли (17,711) ним (19,093) i (90,275) у (94,504) те (32,651) ш (19,163)

ння (9,001) всь (10,016) ими (31,121) ням (9,434) ш (31,679) мУ (35,023) не (20,257) еш (11,138)

мося (20,532) сть (7,688) ами (20,106) нням (8,975) Bi (22,543) нУ (23,125) йте (11,230) е (11,466)

лося (10,233) юсь (8,047) ями (9,844) кУ (11,624) ri (12,596) нш (19,549) ете (11,137) к (7,299)

ться (25,036) ють (11,222) ати (10,819) ому (31,585) нш (9,909) ний (19,042) х (61,506)

R = {S ^ NPVP,S ^ NPVPPP,NP ^ AN,PP ^PNP,VP ^ VNP, NP ^ система, V ^ рубрикувати, A ^ украшомовний, A ^ ключовий, N ^ контент, N ^ слово, P ^ за}.

The disadvantage of using the CFG is the periodic appearance of ambiguity in the SA, for example, "Система рубрикуе украшомовний контент за ключовими словами / The system categorizes Ukrainian-language content by keywords" (Fig. 9).

V=verb, DET=determinant, or N=noun). If a word has several meanings, the analysis also includes several columns of the main parts of speech.

The Ontology Matcher Demo uses metadata to identify ontology objects in the text (Fig. 12). The program corresponds to the concepts in the Finnish general ontology with approximately 28,000 notions in each language. The found notions of ontology are given in the text below as a reference. With the cursor over a word, there appears the notion to which the word refers.

PP

NP

NP VP

Система V NP

рубрикуе A N за AN

Il II

украшомовний контент ключовими словами

NP VP

Система V NP

рубрикуе украшомовний

A

I

ключовими словами

b

Fig. 9. The CFG ambiguity: a — example 1; b — example 2

The examples of known SA systems for English language tests are "Machinese Word combination Tagger" [37] and VISL [38]. There is no available online information resource for the SA of Ukrainian-language texts. We will analyse the results of the SA of an English text through these resources through the example of such a sentence set: "The train went up the track out of sight, around one of the hills of burnt timber. Nick sat down on the bundle of canvas and the baggage that I had got out of the door of the baggage car." "Machinese Word combination Tagger" is a text analyser that processes base forms and component structures. It also recognizes the "part of speech" classes (noun, adjective, verb, pronoun, etc.) and generates a micro-indicative syntax of a word combination, marks fragments or brackets noun word combinations (Fig. 10).

The Connexor Machinese Tokenizer is a set of components of the program that performs the basic tasks of text analysis at a very high speed and provides relevant word information for bulk programs. The Machinese Tokenizer splits the text into clear words and provides possible forms and classes for the words (Fig. 11). The first column displays the position of a token in the text (the calculation in characters); in the following column, there is information on the length of the token; the third column is for the text form, and the other columns contain the main form(s) and the tag denoting the part of speech (PRON=pronoun,

1 Text Baseform Phrase syntax and part-of-speecti 1

The the premodifier, determiner

train train nominal head, noun, single-word noun phrase

wert go main verb, indicative past

on on adverbial head, adverb

up up preposed marker, preposition

the the premodifier, determiner

track track nominal head, noun, single-word noun phrase

out out adverbial head, adverb

of of preposed marker, preposition

sight sight nominal head, noun, single-word noun phrase

around around preposed marker, preposition

one one nominal head, pro-nominal

of of postmodifier, preposition

Fig. 10. English Machinese Word combination Tagger 4.9.1 analysis

0 4 This this PRON

5 2 is be V

8 1 a a DET

10 4 test test N test V

Fig. 11. The Machinese Tokenizer

The train went on up the track out of sight, around one of the hills of burnt timber. Xick sat down 07t the bundle of canras and bedding the baggage man had pitched out of the door of the baggage car.

Fig. 12. The Ontology Matcher

Fig. 13, 14 show the result of SA through the VISL information resource

For SA of Ukrainian-language texts, such information resources do not exist [39-42]. Moreover, the SA process itself is rather cumbersome for Ukrainian-language content [43-46]. Let us consider the example of the input sentence: "Вш зробив це так незручно, що зачепив образок мого ангела, який ви«в на дубовш спинщ лiжка, i що вбита муха впала меш прямо на голову" ("He made it so uncomfortable that he touched the image of my angel that hung on the oak-bed backboard, and that the killed fly fell to my head").

Its SA example with using pre-syntax is shown in Fig. 15.

Parsing by chunks is a breakdown of sentences into non-intersecting word combinations [47-51], i. e. (flat structure) ^complete parsing, for example, (the boy (with the hat)) <—> (the boy) with (the hat)).

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

a

Tree structure

Enter English text to parse:

The train went on up the track out or Eighty around one of the hills of burnt timber. Parse and Show

Export and Download

Reset

Visualization: I Vertical I Notational convention I default T

<(54gt;

SOURCE: Running text

1. The trsin went on up the track out of sightj around one of the hills of burnt timber. A1

STA:cl(fcl)

Ig(np)

[-□:art("the 1 S/P) The i -H:n("train1 S NOM) train |-P:g(vp)

[ -H: v( "go" IMPF) Lient

|-Di adv('on') on

I-A=S(PP)

-H:prp("up') up

I |-D:g(np)

|-D:art{'the" S/P) the |-H:n("track" 5 MOM) track

I"*:g(PP)

|-H:prp("out_of1) out_of |-D:n("sight1 S NOM) sight l"D:g(pp)

|-H:prp("around") around |-0:g(np)

-H:nun{'one" <card> S) one

-D:g(PP) |-H:prp(1 of") of |-D:g(np)

Fig. 13. The structure of a tree on the VISL information resource

<B>

The [the] <*> <def> ART S/P @>N #l->2

train [train] <Vgromid> <def> <nhead> N S NOM @SUBJ> #2->3

went [go] <move> <mv> V IMPF @FS-STA #3->0

on [on] ADV (§MV< #4->3

up [up] PRP @<SA #5->3

the [the] <def> ART S/P @>N #6->7

track [track] <Lpath> <sem-l> <def> <nhead> N S NOM @P< #7->5

out of [out—of] <complex> PRP @<ADVL #8->3

sight [sight] <percep-w> <Labs> <idf> <nhead> N S NOM @P< #9->8

, [,] PU @PU #10->0

around [around] <insertion> PRP @>A # 11 ->0

one [one] <fr:78> <f:3664212> <eard>NUM S @P< #12->11 of [of] <np-elose> PRP @N< #13->12

the [the] <def> ART S/P @>N#14->15

hills [hill] <Lmountain> <def> <nhead> N P NOM @P< #15->13

of [of] <np-close> PRP @N< # 16-> 15

burnt [burnt] <SYN:eooked> <SYN:destroyed> <jppl> <teinpered-2>

<SYN:treated> AD J POS @>N #17->18

timber [timber] <mat> <idf> <nliead> N S NOM @.P< #18->16

. [.] PU @PU #19->0

</B>

Fig. 14. The result of SA through the VISL information resource

Частина речения: (*вш зробив це так незручно,*) --- вш[1](д1еслово)зробив[2](кого)це[3] зробив[2] (як) так [4] зробив[2](предикатив) незручно [5] зробив[2] (як) незручно [5]

Частина речения: (вшо зачепив образок мого ангела,*)

— образок[9] (даеслово) зачепив[8](кото)шо[7]

образок[9](який) мото[10] {¿[20]} ангела[11](якого) мото[10] Частина речения: (*який виов на спиищ лЬкка,*)

{образок[9]} (який)який[ 13 ] (д1еслово) вис1в[ 14](приймеиник)на[ 15 ](чому) спинщ[17](якш дубов!й[16] спинщ[17](чого) лЪкка[18]

Частина речения: (*1<) {образок[9]}Ц20] Частина речения: (* що вбита муха впала меш прямо на голову.*)

— муха[23] (д1еслово) впала[24](кому) мею[25] (приймеиник)иа[27](кого) голову[28] иа[27](кого) голову[28] впала[24](приймениик)иа[27]

впала[2 4](як)прямо [26]

муха[23] (яка) вбита[22] {1[20]} що[21]

— меш[25] (ирийменник)иа[27] иезвязн: вш[1], муха[23].

=в речеиш сл!в всього: 25, с.тв иезв'язно: 2. ¿з них прийменнигав:0. час оирацюваиия: 0.050с.

Вш[1] зробив[2] це[3] так[4] иезручио[5] ,[6] що[7] зачеиив[8] образок[9] мого[10] аигела[11] .[12] який[13] вис1в[14] на[15] дубовш[16] спинц1[17] лгжка[18] ,[19] ¡[20] шо[21] вбига[22] муха[23] впала[24] меш[25] прямо[26] на[27] голову[28] .[29]_

JN

where erage,

Fig. 15. The result of the SA of the Ukrainian-language sentence

5. Results of studying the definition of stable word combinations when identifying keywords for text content

To isolate stable word combinations in the analysed texts and to conduct a comparative analysis, we will use 4 different methods: FREQ (frequency+morphological patterns, that is, direct counting of the number of words) [52]; t-test [53]; statistics %2 [54]; LR as a likelihood ration [55]. A col- =

location is a word combination that has features of a syntactically and semantically integral unit [56]. In it, the choice of one component is based on the context, and the choice of another depends on the choice of the first element [57]. For example, ставити умови (to set conditions): the choice of the verb ставити (to set) is determined by tradition and depends on the noun умови; with the noun пропозицгю (suggestion, proposal), there will be another verb - вносити (to make). This concerns a limited (selective) combining of words: word combinationologisms, idioms, proper names, and brandnames. A collocation also usually includes components of toponyms, anthroponyms, and other frequently used naming conventions (for example, супермаркет «Метро» (Metro supermarket), завод «Електрон» (Electron factory)) [58]. Other names for the same phenomenon are stable (set) or word combinationological units and N-grams. Examples

of collocations are the following: s2 = p(1 - p) = p

- грати роль, мати значення, впливати, справляти враження; with

- засоби масовог..., зброя масовог., вищий навчаль-

ний...; x = 18/1368

- глибокий старець « поверхневий/мглкий невеликий юнак; and

- мгцний чай « сильний чай;

- кока-кола, Microsoft Windows; t =

- Гола Пристань, Володимир Волинський, Нью Йорк, Стгв Джобс.

1. The FREQ method is a direct calculation of the frequency of using pairs (triples) of words. For example, FREQ for the sentence "В лгтератург описано декшька пгдходгв до автоматичного видглення стгйких словосполучень." ^ в лгтератург; лгтератург описано; описано декшька; декшька пгдходгв; пгдходгв до; до автоматичного; автоматичного видглення; видглення стгйких; стгйких словосполучень. Unfortunately, as a result of applying this method to large volumes of text, we receive information noise due to the high frequency of function words. The method also requires taking into account the frequency of use and the patterns of word combinations. An example of morphology rules in FREQ is as follows:

A N: турецький гамбгт (Turkish gambit), перша похгдна (first derivative), гнфор-мацгйний ресурс (information resource);

N Ng: контент аналгз (content analysis), баланс гнтересгв (balance of interests), контент-комерцгя (content commerce), контент монгторинг (content monitoring);

N Pr N: трава у дворг (grass in the yard), дрова на травг (firewood on the grass).

2. The t-test method consists in checking statistical hypotheses and using the statistical model of MA:

- Но: words found accidentally;

- P (wV) = P (w1)P (w2);

- taking into account not only pairs but also the frequency of using separate words (those that make up a pair);

X -ц

is the empirical average, ц is the theoretical avis the empirical variance, and N is the size of the empirical sample.

The method is not quite correct for the language, but it helps get results in practice - for example, the frequency of the occurrence of the stable word combination контент аналгз (content analysis) in [14] with Р(контент)=28/1368 and Р(аналiз)=38/1368 is

Н0:р=Р(контент аналiз)=Р(контент)Р(аналiз)=

= 0.20468 ■ 0.27778 = 5.69 ■ 10-4.

In the Bernoulli scheme

X-ц_ 0.013158-5.69 10-

5.69 101,368

19.52816.

4

3. The Pearson %2 method is applied to tables of 2x2 (Table 4). In the calculations, normality is not expected.

log X = P

= L(H1),

L(H 2)"

, (1_ p)C1 -C12 • pC2-C12

(1 _ P) N

Table 4

An example of using the Pearson %2 method

Wi w1=KOHTeHT wj^KOHTeHT

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

®2=aHani3 18 (content analysis) 20 (e. g., statistical analysis)

w2^aHani3 10 (including content monitoring) 1320 (including statistical monitoring)

p1r12 ■ (1 - p1)r1-r12 ■ p2r2-r12 ■ (1 - p2)N-r1 -r2+r12' For example, with c1 = 28, c12 = 18 and c2 = 38,

log X =

= L(Hi)/

/l(H2)"

38 1,368

1 _

38 11368

38 1,368

1 -

38 11368

1,368-28-38+18

For example,

18118 f 1 _ 18 28 J I 28

4.53355 10_23.

38 _ 18 1,368 _ 28

1 _

38 _ 18 1,368 _ 28

1,368_28_38+18

x2 =

N (O1O22 _ O12O21)2

(On + O12 )(On + O21 )(O12 + O22 )(O21 + O22 )

1368(18 1320 _ 20 10)2

(18 + 20)(18 +10)(20 +1,320)(10 +1,320)

400.44106.

4. The LR method is used to calculate the hypotheses (P1 >> P2)

H1: P(w2\wl) = p = P(®2|-®1), H2: P(w2 \ w1) = p1 * p2 = P(w2 \ -w1), where

P = ^7' P1 N c

P2 =——-. N - C1

Then, using a binomial distribution b( m, n, p) = Cym (1 - p)n-m, we obtain the relation of likelihood LR: L(H1) = b(c12, C1, p)b(C2 - C12, N - C1, P),

L(H2 ) = b(rl2, CV p1)b(c2 - C12, N - C1, P2),

log X =

= L(H1),

XL(H2)'

where -2logX in the asymptotics is distributed as %2, i. e.,

L(H1) = Q2 ■ pc12 ■ (1 - p)C1 -r !

■ (1 - p)r1 -

fc2 C12 TiC2 c12

• CN_c1 p

-;--p 1

C12 KC1 _ C12 )!

" " (1 _ p)N_c1 _c2 + C12

• (1 _ P)N_C1 ^

(N _ C1)!

(C2 _ C12 )! • (N _ C1 _ (C2 _ C12 ))!

x p

L(H2) = CC12 • pC

(1 _ A)C1 _C12 (1 _ A)C1

CM

-1--p1

C12 ! <C1 _ C12 )!

' ' (1 _ p2)N_C1_C2+C12

2 ^(1 _ p2 )N_C1 _C2 + C12 (N _ C1)!

(C2 _ C12 )! • (N _ C1 _ (C2 _ C12 ))!

X p22 Then

In order to choose the optimal statistical method for determining stable word combinations, it is necessary to analyse a Ukrainian language text based on the stems of words without taking into account their inflexions. It will greatly improve the accuracy of the result.

6. Discussion of the research results on identifying stable word combinations for keyword identification

An experiment of distinguishing terms was carried out on 3 technical articles [1-3] written in two languages -Ukrainian and English. The template for the experiment contained the following: [Adj+N], [Diyeprykm.+N], [N+N, Gen.], [N+N, Abl.], [N+'-'+N]. The experiment included 6 methods for determining the keywords: manually by authors (A); via the system Victana.lviv.ua [23], according to Zipf's law (B); by FREQ (C); by t-test (D); by LR (F); by x2 (G). The analysis of the 3 articles [1-3] was conducted in Ukrainian and the results were translated into English (Tables 5, 6). The keywords in bold are those that occurred in the results of applying all the methods, the italicized keywords are only those obtained through the B-G methods, and the underlined keywords are those in the methods A and C-G. While conducting linguistic analysis for compiling alphanumeric dictionaries of two words, the following features and algorithms were used:

- bigrams were formed within the punctuation marks (if there was at least one punctuation mark between the words, these words were not considered as a bigram);

- an alphanumeric dictionary of two-word combinations was formed on the basis of stems, that is, the bigrams Konmenuu anaM3 and Konmenmnoio anaM3y were considered as one and the same bigram;

- in the analysis of the inflexions of the analysed words, verbs were not taken into account when forming the alphanumeric dictionary of bigrams (verbs were considered as punctuation marks);

- before the linguistic analysis of the texts, all stop words (particles, adverbs, conjunctions) and pronouns (they were also considered as punctuation marks) were excluded.

The statistical methods make it possible to take into account the use of separate words. The peculiarities that are associated with using the methods for different volumes of data and probability ranges (better than the t-test for larger p, where normality is violated; the likelihood ratio is better approximated with %2 than tables 2x2 for small volumes). They are often used not for the acceptance/rejection of hypotheses but for the ranking of candidate word combinations.

C

The list of frequency index for stable word combinations in articles [1—3]

No. Author's as to [23] FREQ, t-test LR X2

1 2 3 4 5 6

Q A B C, D F G

In work [1] in Ukrainian

1 Стиль автора Стоп-слово Вщносна частота Коефщент кореляцп Коефщент кореляцп

2 Статистичний анашз Метод визначення Коефщент кореляцп Вщносна частота Вщносна частота

3 Лшгвютичний анашз Визначення стилю Стиль автора Частота появи Частота появи

4 Kвантитативна лшгвютика Стиль автора Визначення стилю Стопове слово Авторська атрибушя

5 Авторська атрибyцiя Анашз уривку Стопове слово Украшомовний текст Стиль автора

6 Визначення стилю Частота появи Украшомовний текст Стиль автора Украшомовний текст

7 Украшомовш тексти Автор тексту Частота появи Поява слова Стопове слово

8 Технолопя лшгвометрп Уривок тексту Авторська атрибушя Авторська атрибушя Визначення стилю

9 Технолопя стилеметрп' Коефщент кореляцп Поява слова Визначення стилю Поява слова

10 Технолопя глоттохронологл Дослщження тексту Автор тексту Слова уривку Слова уривку

In work [2] in Ukrainian

1 Web Mining Ключове слово Ключове слово Текстовий контент Текстовий контент

2 Kонтент-монiторiнг Контент-аналiз Текстовый контент Ключове слово Тематичний словник

3 Ключовi слова Визначена системою Web Mining Тематичний словник Ключове слово

4 Контент-аналiз Формування системою Тематичний словник Слова контенту Слова контенту

5 Стеммер Портера Web Mining Визначення ошв Ключове словосполучення Множина слгв

6 Лшгвютичний анашз Слова контенту Ключове словосполучення Визначення сшв Формування системою

7 Метод визначення Текстовый контент Слова контенту Формування системою Web Mining

8 Визначення ошв Анашз статистики Множина слв Web Mining Визначення ашв

9 Слов'янськомовш тексти Ключове словосполучення Формування системою Слова контенту Слова контенту

10 Технолопя NLP Множина слгв Контент-аналiз Контент-мошторшг Контент-мошторшг

In work [3] in Ukrainian

1 1нформацшний ресyрс Контент-аналiз Психолопчний стан Психолопчна особистють Психолопчна особистють

2 Контент-аналiз Стоп- слово Психолопчна особистють Психолопчний стан Психолопчний стан

3 Лшгвютичний анашз Тематичний словник Контент-аналiз Формyвання зрiзy Формyвання зрiзy

4 Морфолопчний анашз Пости користyвача Марковане слово Стан особuстостi Зрiз станy

5 Сощальна мережа Повщомлення користyвача Психолопчний зрiз Марковане слово Марковане слово

6 Формyвання зрiзy Kористyвач мережi Стан особuстостi Психолопчний зрiз Контент-аналiз

7 Зрiз розумШНЯ Стан особuстостi Формyвання зрiзy Контент-аналiз Психолопчний зрiз

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

8 Розyмiння особистост Анашзована особистють Зрiз станy Зрiз станy Стан особuстостi

9 Украïномовнi тексти Сощальна мережа Зрiз особистост Анашзована особистють Сощальна мережа

10 Big-Five Диспозицп особистосл Сощальна мережа Сощальна мережа Анашзована особистють

In work [1] in English

1 Style of the author Reference fragment Reference fragment Words fragment Words fragment

2 Statistical analysis Author's style Words fragment Reference fragment Reference fragment

3 Linguistic analysis Author's text Syntactic words Stop words Recognition author

4 Quantitative linguistics Syntactic words Frequency fragment Swadesh list Stop words

5 Author's attribution Stop words Swadesh list Recognition author Swadesh list

6 Recognition of style Formatted fragments Stop words Syntactic words Syntactic words

7 Ukrainian texts Anchor words Author style Frequency fragment Frequency fragment

8 Linguometry technology Author's language Recognition author Author's text Author's text

Table 6

Differences of the methods according to the rating list of 100 stable word combinations

1 2 3 4 5 6

9 Stylemetry technology Method of anchor Author's text Anchor words Author style

10 Glottochronology technology Frequency dictionary Anchor words Author style Anchor words

In work [2] in English

1 Web Mining Text content Text content Web mining Web mining

2 Content monitoring Content analysis Web mining Text content Text content

3 Content analysis Analysis of statistics Keywords text Keywords content Keywords content

4 Porter stemmer Defined systematically Keywords defined Keywords text Analysis text

5 Linguistic analysis Stop word Analysis text Keywords defined Keywords text

6 Determining the keywords Potential keywords Keywords content Stop word Keywords defined

7 Slavic language Content monitoring Content monitoring Analysis text Stop word

8 Slavic texts Author's keywords Content analysis Author's keywords Content monitoring

9 Method for determining Keywords content Stop word Content monitoring Content analysis

10 Web technology Direct word Author's keywords Content analysis Author's keywords

In work [3] in English

1 Information resource Content analysis Content analysis Psychological personality Content analysis

2 Content analysis Psychological state Psychological personality Psychological state Psychological personality

3 Linguistic analysis Personality analysis Psychological state Content analysis Psychological state

4 Morphological analysis Personality disposition Social networks Based analysis Based analysis

5 Social network Psychological analysis Marked words State personality Psychological base

6 Status of personality Personality model State personality Psychological base State personality

7 Personality understanding Stop words Based analysis Social networks Social networks

8 Formation of the status Psychological disposition Psychological base Marked words Psychological base

9 Stop words Content monitoring State based State based Marked words

10 Method of formation Social network Based content Psychological base State based

Q I A B C D F G | A B C D F G | A B C D F G

For the Ukrainian articles 1-3]

A 1 0.23 0.47 0.35 0.27 0.21 1 0.27 0.51 0.39 0.31 0.25 1 0.25 0.49 0.36 0.29 0.23

B 0.23 1 0.63 0.61 0.52 0.43 0.27 1 0.65 0.63 0.57 0.47 0.25 1 0.64 0.62 0.55 0.45

C 0.47 0.63 1 0.93 0.17 0.71 0.51 0.65 1 0.94 0.25 0.73 0.49 0.64 1 0.93 0.21 0.72

D 0.35 0.61 0.93 1 0.19 0.75 0.39 0.63 0.94 1 0.26 0.77 0.36 0.62 0.93 1 0.22 0.76

F 0.27 0.52 0.17 0.19 1 0.26 0.31 0.57 0.25 0.26 1 0.39 0.29 0.55 0.21 0.22 1 0.33

G 0.21 0.43 0.71 0.75 0.26 1 0.25 0.47 0.73 0.77 0.39 1 0.23 0.45 0.72 0.76 0.33 1

For the English articles [1-3]

A 1 0.27 0.51 0.47 0.31 0.27 1 0.31 0.55 0.51 0.35 0.31 1 0.29 0.53 0.49 0.33 0.29

B 0.27 1 0.66 0.64 0.55 0.47 0.31 1 0.69 0.67 0.59 0.49 0.29 1 0.68 0.65 0.57 0.48

C 0.51 0.66 1 0.95 0.23 0.76 0.55 0.69 1 0.96 0.27 0.77 0.53 0.68 1 0.95 0.24 0.75

D 0.47 0.64 0.95 1 0.21 0.79 0.51 0.67 0.96 1 0.29 0.81 0.49 0.65 0.95 1 0.25 0.78

F 0.31 0.55 0.23 0.21 1 0.31 0.35 0.59 0.27 0.29 1 0.41 0.33 0.57 0.24 0.25 1 0.37

G 0.27 0.47 0.76 0.79 0.31 1 0.31 0.49 0.77 0.81 0.41 1 0.29 0.48 0.75 0.78 0.37 1

To compare the results, we used the Google-based library word2vec, which has proven itself as an alternative of TFxIDF (Ai in Table 7 according to the template ['bigram', number of uses]). We also used the built-in methods to search for word combinations in Python. However, for these

datasets, it did not work effectively, because for high-quality work, it needs a huge corpus [58]. The most interesting thing is that the system allows doing it after transferring each word from the corpus to a space whose dimension is specified by the user, for example, [king' + 'woman' - 'man' = 'queen''].

tî

Differences of other methods by ranking the frequency of occurrence of stable word combinations is articles [1—3]

Method Language Article [1] Article [2] Article [3]

1 2 3 4 5

('психолопчного стану', 16)

UA ('контент мошторшгу', 13) ('тематичного словника', 11) ('формування зрiзу', 12) ('sfx_a', 12)

('слов янськомовних', 10) ('структурну схему', 7) ('вщкритють досв^', 6) ('зрiзу психолопчного', 2)

Ai ENG ('swadesh list', 18) ('based on', 15) ('based on', 20) ('slavic language', 15) ('author s', 13) ('based on', 35) ('psychological state', 26) ('social networks', 22) ('his_her', 11) ('following structural', 8) ('big_five', 7) ('let_us', 7) ('structural scheme', 4)

(('службових', 'ошв'), 32) (('ключових', 'сшв'), 72) (('на', 'основГ), 21)

(('стопових', 'ошв'), 24) (('текстового', 'контенту'), 21) (('психолопчного', 'стану'), 18)

(('визначення', 'визначення'), 23) (('на', 'етат'), 17) (('контент', 'анашзу'), 16)

(('стилю', 'стилю'), 22) (('визначення', 'ключових'), 16) (('маркованих', 'сшв'), 15)

UA (('ошв', 'слiв'), 22) (('списку', 'сводеша'), 20) (('в', 'уривку'), 19) (('опорних', 'слiв'), 18) (('крок', '1'), 16) (('крок', '2'), 16) (('web', 'mining'), 15) (('сив', 'в'), 14) (('зрiзу', 'психолопчного'), 14) (('стану', 'особистосп'), 14) (('формування', 'зрiзу'), 12) (('особистосп', 'на'), 12)

(('стилю', 'автора'), 17) (('тематичного', 'словника'), 11) (('sfx', 'a'), 12)

A2 (('автора', 'автора'), 17) (('для', 'визначення'), 10) (('основГ, 'контент'), 11)

ENG (('of, 'the'), 107) (('author', 's'), 52) (('of, 'a'), 51) (('in', 'the'), 46) (('the', 'author'), 45) (('reference', 'fragment'), 31) (('analysis', 'of), 24) (('words', 'in'), 22) (('to', 'the'), 21) (('the', 'method'), 21) (('of, 'the'), 134) (('in', 'the'), 61) (('by', 'the'), 45) (('analysis', 'of), 39) (('of, 'a'), 31) (('the', 'text'), 30) (('the', 'system'), 30) (('to', 'the'), 29) (('of, 'keywords'), 28) (('text', 'content'), 27) (('of, 'the'), 134) (('is', 'the'), 117) (('the', 'content'), 45) (('of, 'a'), 43) (('analysis', 'of), 37) (('based', 'on'), 35) (('on', 'the'), 34) (('in', 'the'), 33) (('content', 'analysis'), 30) (('the', 'process'), 27)

UA (('сшв', 'сшв'), 88) (('стилю', 'автора'), 68) (('службових', 'сшв'), 63) (('визначення', 'стилю'), 61) (('списку', 'сводеша'), 56) (('стопових', 'сшв'), 48) (('визначення', 'автора'), 45) (('авторського', 'мовлення'), 33) (('опорних', 'сшв'), 31) (('стилю', 'стилю'), 30) (('ключових', 'сшв'), 74) (('сшв', 'в'), 24) (('web', 'mining'), 22) (('текстового', 'контенту'), 21) (('на', '2'), 20) (('визначення', 'ключових'), 19) (('ключових', 'в'), 19) (('визначення', 'сшв'), 18) (('сшв', 'для'), 18) (('на', 'крок'), 18) (('на', 'основГ), 21) (('психолопчного', 'стану'), 18) (('психолопчного', 'особистосп'), 17) (('контент', 'анашзу'), 16) (('стану', 'особистосп'), 15)

Аз (('маркованих', 'сшв'), 15) (('зрiзу', 'психолопчного'), 14) (('зрiзу', 'стану'), 14) (('зрiзу', 'особистосп'), 14) (('особистосп', 'на'), 14)

ENG (('of, 'the'), 186) (('the', 'of), 169) (('of, 'of), 152) (('of, 'a'), 81) (('the', 'the'), 75) (('the', 'author'), 66) (('and', 'of), 63) (('in', 'the'), 57) (('of', 'author'), 57) (('of, 'words'), 55) (('of, 'the'), 258) (('the', 'of), 235) (('of, 'of), 137) (('the', 'the'), 122) (('of, 'keywords'), 72) (('in', 'the'), 71) (('a', 'of'), 70) (('and', 'of'), 69) (('by', 'the'), 64) (('of, 'content'), 63) (('the', 'of), 304) (('of, 'the'), 243) (('the', 'the'), 168) (('of, 'of), 162) (('is', 'the'), 154) (('of, 'a'), 91) (('the', 'is'), 76) (('the', 'content'), 71) (('is', 'of), 61) (('and', 'the'), 57)

(('сшв', 'сшв'), 88) (('text', 'content'), 30) (('на', 'основГ), 21)

A4 (('стилю', 'автора'), 68) (('службових', 'сшв'), 63) (('визначення', 'стилю'), 61) (('списку', 'сводеша'), 56) (('стопових', 'сшв'), 48) (('web', 'mining'), 24) (('keywords', 'text'), 23) (('keywords', 'defined'), 22) (('stage', '1'), 20) (('analysis', 'text'), 18) (('психолопчного', 'стану'), 18) (('психолопчного', 'особистосп'), 17) (('контент', 'аналiзу'), 16) (('стану', 'особистосп'), 15) (('маркованих', 'сшв'), 15)

(('визначення', 'автора'), 45) (('авторського', 'мовлення'), 33) (('опорних', 'сшв'), 31) (('стилю', 'стилю'), 30) (('step', '2'), 18) (('keywords', 'content'), 17) (('content', 'monitoring'), 17) (('step', '1'), 17) (('зрiзу', 'психолопчного'), 14) (('зрiзу', 'стану'), 14) (('зрiзу', 'особистосп'), 14) (('особистосп', 'на'), 14)

1 2 3 4 5

(('fragment', 'fragment'), 37) (('reference', 'fragment'), 35) (('words', 'fragment'), 25) (('syntactic', 'words'), 21) (('frequency', 'fragment'), 19) (('swadesh', 'list'), 19) (('stop', 'words'), 18) (('author', 'style'), 17) (('fragment', '3'), 17) (('recognition', 'author'), 16) (('ключових', 'ошв'), 74) (('сшв', 'в'), 24) (('web', 'mining'), 22) (('текстового', 'контенту'), 21) (('на', '2'), 20) (('визначення', 'ключових'), 19) (('ключових', 'в'), 19) (('визначення', 'сшв'), 18) (('ошв', 'для'), 18) (('на', 'крок'), 18) (('content', 'analysis'), 40) (('psychological', 'personality'), 27) (('psychological', 'state'), 26) (('social', 'networks'), 22) (('marked', 'words'), 21) (('state', 'personality'), 20) (('based', 'analysis'), 19) (('psychological', 'based'), 18) (('state', 'based'), 18) (('based', 'content'), 18)

After the transference into a space of some dimension, each word becomes a vector, so words can form basic relational operations of addition, subtraction, multiplication, etc. Besides, let us consider the analysis through the bigrams (A2 in Table 7) and the skipgrams (A3 in Table 7). The results are better than those obtained through word2vec, which means that it is the best way to analyse skipgrams with a value of 3 and also to eliminate stop words in English (A4 in Table 7). However, these results are far enough from the ones listed in Table 5. The outcome is worse due to the failure to identify punctuation marks and the use of stop words in linguistic analysis as content units of speech.

7. Conclusions

1. The study has developed a method for determining stable word combinations while identifying keywords of text content in standard passages of an author's text. For this purpose, the well-known statistical methods for determining stable word combinations when identifying keywords of text content were analysed. The factors influencing the quality of identifying stable word combinations were determined during the pre-linguistic elaboration of these texts. A comparative analysis of the corresponding methods was carried out on the basis of the obtained results. The developed method consists in using Zipf's law in the formation of stable word combinations as keywords, taking into account the following rules of a preliminary linguistic processing of the text:

- removing all word stops; bigrams are formed only within the limits of punctuation marks; the verb and the pronoun are to be considered punctuation marks;

- verbs are determined by their inflexions; bigrams are formed on the basis of stems without taking into account inflexions;

- adjectives are identified by their inflexions, and it is assumed that adjectives should occupy only the first place in the bigrams of Ukrainian texts.

This allowed taking into account the peculiarities of constructing keywords in the Ukrainian language, regardless of the inflexions within the word combinations. Also, the results obtained were closer to the number of keywords identified by the authors. This increases 1.4 times the degree of relevancy of the analysed content.

2. A program set has been developed to identify stable word combinations as keywords. An approach has been suggested for devising linguistic content analysis software to determine stable word combinations while identifying keywords of Ukrainian and English text-based contents. The peculiarity of the approach is that the linguistic statistical analysis of lexical units is adapted to the peculiarities of Ukrainian-language and English-language words/texts.

The developed information system, which is based on identified stable word combinations, helps convey more accurately the analysed content in accordance with the author' idea about it. This can produce a more accurate search result for the user and can better render the opinion of the author about the content under analysis.

3. The results of the experimental testing of the proposed method of content analysis of English and Ukrainian texts for determining stable word combinations when identifying the keywords of technical texts have been verified.

The developed method conveys the content of the analysed text by the identified keywords in the form of stable word combinations more accurately than other known resources. Further experimental research requires approbation of the proposed method for determining stable word combinations in other categories of texts - scientific, humanitarian, belletristic, journalistic, etc.

References

1. Development of a method for the recognition of author's style in the Ukrainian language texts based on linguometry, stylemetry and glottochronology / Lytvyn V., Vysotska V., Pukach P., Bobyk I., Uhryn D. // Eastern-European Journal of Enterprise Technologies. 2017. Vol. 4, Issue 2 (88). P. 10-19. doi: 10.15587/1729-4061.2017.107512

2. Development of a method for determining the keywords in the slavic language texts based on the technology of web mining / Lytvyn V., Vysotska V., Pukach P., Brodyak O., Ugryn D. // Eastern-European Journal of Enterprise Technologies. 2017. Vol. 2, Issue 2 (86). P. 14-23. doi: 10.15587/1729-4061.2017.98750

3. The method of formation of the status of personality understanding based on the content analysis / Lytvyn V., Pukach P., Bobyk I., Vysotska V. // Eastern-European Journal of Enterprise Technologies. 2016. Vol. 5, Issue 2 (83). P. 4-12. doi: 10.15587/1729-4061.2016.77174

4. Mobasher B. Data mining for web personalization // The adaptive web. 2007. P. 90-135. doi: 10.1007/978-3-540-72079-9_3

5. Dinuca C. E., Ciobanu D. Web Content Mining // Annals of the University of Petro ani. Economics. 2012. Vol. 12, Issue 1. P. 85-92.

................................................................................................................................................................................................................................KS

6. Xu G., Zhang Y., Li L. Web content mining // Web Mining and Social Networking. 2011. P. 71-87. doi: 10.1007/978-1-4419-7735-9_4

7. Khomytska I., Teslyuk V. The Method of Statistical Analysis of the Scientific, Colloquial, Belles-Lettres and Newspaper Styles on the Phonological Level // Advances in Intelligent Systems and Computing. 2017. Vol. 512. P. 149-163. doi: 10.1007/978-3-319-45991-2_10

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

8. Khomytska I., Teslyuk V. Specifics of phonostatistical structure of the scientific style in English style system // 2016 Xlth International Scientific and Technical Conference Computer Sciences and Information Technologies (CSIT). 2016. doi: 10.1109/ stc-csit.2016.7589887

9. Avtomaticheskaya obrabotka tekstov na estestvennom yazyke i komp'yuternaya lingvistika / Bol'shakova E., Klyshinskiy E., Lande D., Noskov A., Peskova O., Yagunova E. Moscow: MIEM, 2011. 272 p.

10. Anisimov A., Marchenko A. Sistema obrabotki tekstov na estestvennom yazyke // Iskusstvennyy intellekt. 2002. Issue 4. P. 157-163.

11. Perebyinis V. Matematychna linhvistyka. Ukrainska mova. Kyiv, 2000. P. 287-302.

12. Buk S. Osnovy statystychnoi lingvistyky. Lviv, 2008. 124 p.

13. Perebyinis V. Statystychni metody dlia linhvistiv. Vinnytsia, 2013. 176 p.

14. Braslavskiy P. I. Intellektual'nye informacionnye sistemy. URL: http://www.kansas.ru/ai2006/

15. Lande D., Zhyhalo V. Pidkhid do rishennia problem poshuku dvomovnoho plahiatu // Problemy informatyzatsii ta upravlinnia. 2008. Issue 2 (24). P. 125-129.

16. Varfolomeev A. Psihosemantika slova i lingvostatistika teksta. Kaliningrad, 2000. 37 p.

17. Sushko S., Fomychova L., Barsukov Ye. Chastoty povtoriuvanosti bukv i bihram u vidkrytykh tekstakh ukrainskoiu movoiu // Ukrainian Information Security Research Journal. 2010. Vol. 12, Issue 3 (48). doi: 10.18372/2410-7840.12.1968

18. Kognitivnaya stilometriya: k postanovke problemy. URL: http://www.manekin.narod.ru/hist/styl.htm

19. Kocherhan M. Vstup do movoznavstva. Kyiv, 2005.

20. Rodionova E. Metody atribucii hudozhestvennyh tekstov // Strukturnaya i prikladnaya lingvistika. 2008. Issue 7. P. 118-127.

21. Meshcheryakov R. V., Vasyukov N. S. Modeli opredeleniya avtorstva teksta. URL: http://db.biysk.secna.ru/conference/conference. conference.doc_download?id_thesis_dl=427

22. Morozov N. A. Lingvisticheskie spektry. URL: http://www.textology.ru/library/book.aspx?bookId=1&textId=3

23. Victana. URL: http://victana.lviv.ua/index.php/kliuchovi-slova

24. Method of Integration and Content Management of the Information Resources Network / Kanishcheva O., Vysotska V., Chyrun L., Gozhyj A. // Advances in Intelligent Systems and Computing. 2017. Vol. 689. P. 204-216. doi: 10.1007/978-3-319-70581-1_14

25. Information resources processing using linguistic analysis of textual content / Su J., Vysotska V., Sachenko A., Lytvyn V., Bu-rov Y. // 2017 9th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS). 2017. doi: 10.1109/idaacs.2017.8095038

26. The risk management modelling in multi project environment / Lytvyn V., Vysotska V., Veres O., Rishnyak I., Rishnyak H. // 2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT). 2017. doi: 10.1109/stc-csit.2017.8098730

27. Peculiarities of content forming and analysis in internet newspaper covering music news / Korobchinsky M., Chyrun L., Chyrun L., Vysotska V. // 2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT). 2017. doi: 10.1109/stc-csit.2017.8098735

28. Intellectual system design for content formation / Naum O., Chyrun L., Vysotska V., Kanishcheva O. // 2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT). 2017. doi: 10.1109/stc-csit.2017.8098753

29. The Contextual Search Method Based on Domain Thesaurus / Lytvyn V., Vysotska V., Burov Y., Veres O., Rishnyak I. // Advances in Intelligent Systems and Computing. 2017. Vol. 689. P. 310-319. doi: 10.1007/978-3-319-70581-1_22

30. Marchenko O. Modeliuvannia semantychnoho kontekstu pry analizi tekstiv na pryrodniy movi // Visnyk Kyivskoho universytetu. 2006. Issue 3. P. 230-235.

31. Jivani A. G. A Comparative Study of Stemming Algorithms // Int. J. Comp. Tech. Appl. 2011. Vol. 2, Issue 6. P. 1930-1938.

32. Using Structural Topic Modeling to Detect Events and Cluster Twitter Users in the Ukrainian Crisis / Mishler A., Crabb E. S., Paletz S., Hefright B., Golonka E. // Communications in Computer and Information Science. 2015. Vol. 528. P. 639-644. doi: 10.1007/978-3-319-21380-4_108

33. Rodionova E. Metody atribucii hudozhestvennyh tekstov // Strukturnaya i prikladnaya lingvistika. 2008. Issue 7. P. 118-127.

34. Bubleinyk L. Osoblyvosti khudozhnoho movlennia. Lutsk, 2000. 179 p.

35. Kowalska K., Cai D., Wade S. Sentiment Analysis of Polish Texts // International Journal of Computer and Communication Engineering. 2012. Vol. 1, Issue 1. P. 39-42. doi: 10.7763/ijcce.2012.v1.12

36. Kotsyba N. The current state of work on the Polish-Ukrainian Parallel Corpus (PolUKR) // Organization and Development of Digital Lexical Resources. 2009. P. 55-60.

37. Machinese Phrase Tagger. URL: http://www.connexor.com

38. VISL. URL: http://visl.sdu.dk

39. Classification Methods of Text Documents Using Ontology Based Approach / Lytvyn V., Vysotska V., Veres O., Rishnyak I., Rishnyak H. // Advances in Intelligent Systems and Computing. 2017. Vol. 512. P. 229-240. doi: 10.1007/978-3-319-45991-2_15

40. Vysotska V. Linguistic analysis of textual commercial content for information resources processing // 2016 13th International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science (TCSET). 2016. doi: 10.1109/ tcset.2016.7452160

41. Vysotska V., Chyrun L., Chyrun L. Information technology of processing information resources in electronic content commerce systems // 2016 XIth International Scientific and Technical Conference Computer Sciences and Information Technologies (CSIT). 2016. doi: 10.1109/stc-csit.2016.7589909

42. Vysotska V., Chyrun L., Chyrun L. The commercial content digest formation and distributional process // 2016 XIth International Scientific and Technical Conference Computer Sciences and Information Technologies (CSIT). 2016. doi: 10.1109/ stc-csit.2016.7589902

43. Content linguistic analysis methods for textual documents classification / Lytvyn V., Vysotska V., Veres O., Rishnyak I., Rish-nyak H. // 2016 XIth International Scientific and Technical Conference Computer Sciences and Information Technologies (CSIT). 2016. doi: 10.1109/stc-csit.2016.7589903

44. Lytvyn V., Vysotska V. Designing architecture of electronic content commerce system // 2015 Xth International Scientific and Technical Conference "Computer Sciences and Information Technologies" (CSIT). 2015. doi: 10.1109/stc-csit.2015.7325446

45. Vysotska V., Chyrun L. Analysis features of information resources processing // 2015 Xth International Scientific and Technical Conference "Computer Sciences and Information Technologies" (CSIT). 2015. doi: 10.1109/stc-csit.2015.7325448

46. Application of sentence parsing for determining keywords in Ukrainian texts / Vasyl L., Victoria V., Dmytro D., Roman H., Zoriana R. // 2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT). 2017. doi: 10.1109/stc-csit.2017.8098797

47. Maksymiv O., Rak T., Peleshko D. Video-based Flame Detection using LBP-based Descriptor: Influences of Classifiers Variety on Detection Efficiency // International Journal of Intelligent Systems and Applications. 2017. Vol. 9, Issue 2. P. 42-48. doi: 10.5815/ ijisa.2017.02.06

48. Peleshko D., Rak T., Izonin I. Image Superresolution via Divergence Matrix and Automatic Detection of Crossover // International Journal of Intelligent Systems and Applications. 2016. Vol. 8, Issue 12. P. 1-8. doi: 10.5815/ijisa.2016.12.01

49. The results of software complex OPTAN use for modeling and optimization of standard engineering processes of printed circuit boards manufacturing / Bazylyk O., Taradaha P., Nadobko O., Chyrun L., Shestakevych T. // 2012 11th International Conference on "Modern Problems of Radio Engineering, Telecommunications and Computer Science" (TCSET). 2012. P. 107-108.

50. The software complex development for modeling and optimizing of processes of radio-engineering equipment quality providing at the stage of manufacture / Bondariev A., Kiselychnyk M., Nadobko O., Nedostup L., Chyrun L., Shestakevych T. // TCSET'2012. 2012. P. 159.

51. Riznyk V. Multi-modular Optimum Coding Systems Based on Remarkable Geometric Properties of Space // Advances in Intelligent Systems and Computing. 2017. Vol. 512. P. 129-148. doi: 10.1007/978-3-319-45991-2_9

52. Development and Implementation of the Technical Accident Prevention Subsystem for the Smart Home System / Teslyuk V., Ber-egovskyi V., Denysyuk P., Teslyuk T., Lozynskyi A. // International Journal of Intelligent Systems and Applications. 2018. Vol. 10, Issue 1. P. 1-8. doi: 10.5815/ijisa.2018.01.01

53. Basyuk T. The main reasons of attendance falling of internet resource // 2015 Xth International Scientific and Technical Conference "Computer Sciences and Information Technologies" (CSIT). 2015. doi: 10.1109/stc-csit.2015.7325440

54. Pasichnyk V., Shestakevych T. The model of data analysis of the psychophysiological survey results // Advances in Intelligent Systems and Computing. 2017. Vol. 512. P. 271-281. doi: 10.1007/978-3-319-45991-2_18

55. Zhezhnych P., Markiv O. Linguistic Comparison Quality Evaluation of Web-Site Content with Tourism Documentation Objects // Advances in Intelligent Systems and Computing. 2018. Vol. 689. P. 656-667. doi: 10.1007/978-3-319-70581-1_45

56. Burov E. Complex ontology management using task models // International Journal of Knowledge-based and Intelligent Engineering Systems. 2014. Vol. 18, Issue 2. P. 111-120. doi: 10.3233/kes-140291

57. Smart Data Integration by Goal Driven Ontology Learning / Chen J., Dosyn D., Lytvyn V., Sachenko A. // Advances in Big Data. 2016. P. 283-292. doi: 10.1007/978-3-319-47898-2_29

58. Google - word2vec. URL: https://github.com/danielfrg/word2vec/blob/master/examples/word2vec.ipynb

Analysis of statistical methods for stable combinations determination of keywords identification Текст научной статьи по специальности «Компьютерные и информационные науки»

Похожие темы научных работ по компьютерным и информационным наукам , автор научной работы — Lytvyn V., Vysotska V., Hrendus M., Uhryn D., Naum O.

Текст научной работы на тему «Analysis of statistical methods for stable combinations determination of keywords identification»