Научная статья на тему 'Effective algorithm for parsing sentences using semantically attributed weighted affix context free'

Effective algorithm for parsing sentences using semantically attributed weighted affix context free Текст научной статьи по специальности «Математика»

CC BY
64
29
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
weighted affix context-free grammar / semantic parsing / ontology-driven sentence parsing / template productions / зважена афіксна контекстно-вільна граматика / семантичний розбір / розбір речень з використанням онтологій / шаблонна продукція

Аннотация научной статьи по математике, автор научной работы — Davydov M. V., Lozynska O. V., Pasichnyk V. V.

Context. The problem of increasing efficiency of affix grammars over a finite lattice (AGFL) is considered. AGFL is a context-free grammar with flexible and compact form of productions for parsing texts in natural languages. Objective. The goal of the work is to increase efficiency of parsing sentences by means of AGFL with a modification that adds semantical attributes to the productions and introduces a new form of production called the “template production”. This modification helps to decrease the number of productions that are required to describe a language and lets reduce the computational complexity of the parsing algorithm. Method. A mathematical model of the template production is developed and the theorem is proved that claims that the normal form of the template production exists and the normalization procedure produces an equivalent grammar. The normal form is utilized to increase efficiency of parsing Ukrainian sentences. The template production helps to represent ontology-based rules in a short and computationally inexpensive way. The normal form of template production is studied, and an effective algorithm for parsing sentences is proposed. The worst-case complexity of the proposed algorithm is O(n3 ⋅m3p ⋅mr ), where n is the length of input string of terminals, mp is the maximum number of combinations of symbol and attributes that can produce the same string of terminals, and mr is the maximum number of productions that have the same starting non-terminal symbol in the right part. The growth of parsing time turned out to be almost linear function of the number of words in a sentence when parsing of sentences from the test database of Ukrainian fiction literature. Results. The developed method has been implemented in the UkrParser software that is available open-source on GitHub. Conclusions. The developed algorithm was tested on the database of Ukrainian sentences and demonstrated ten times faster parsing speed than Stanford parser. The future research can be focused on the development of grammatically attributed ontologies for wider set of topics that should improve results of semantical sentence parsing.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

ЕФЕКТИВНИЙ АЛГОРИТМ ДЛЯ СИНТАКСИЧНОГО АНАЛІЗУ РЕЧЕНЬ З ВИКОРИСТАННЯМ СЕМАНТИЧНО ПО- ЗНАЧЕНИХ ЗВАЖЕНИХ АФІКСНИХ КОНТЕКСТНО-ВІЛЬНИХ ГРАМАТИК

Актуальність. Розглядається задача підвищення ефективності афіксних граматик над скінченною граткою (AGFL). AGFL – це контекстно-вільна граматика з гнучкими і компактними формами для розбору текстів на природних мовах. Мета роботи. Метою роботи є підвищення ефективності розбору речень за допомогою модифікації AGFL, яка додає семантичні атрибути в продукції граматики і вводить нову форму продукцій під назвою «шаблонна продукція». Ця модифікація допомагає зменшити кількість продукцій, необхідних для опису мови, і дозволяє зменшити обчислювальну складність алгоритму синтаксичного аналізу. Метод. Розроблено математичну модель шаблонної продукції і доведено теорему про те, що існує нормальна форма шаблонних продукцій, а процедура нормалізації породжує еквівалентну граматику. Нормальна форма використовується для підвищення ефективності розбору українських речень. Шаблонні продукції допомагають описувати правила на основі онтології в короткій і обчислювально ефективній формі. Вивчається нормальна форма шаблонних продукцій і пропонується ефективний алгоритм для розбору речень. У найгіршому випадку обчислювальна складність запропонованого алгоритму становить O(n3 ⋅m3p ⋅mr ), де n – довжина вхідного рядка терміналів, mp – максимальне число комбінацій символів і атрибутів, які можуть породжувати один і той самий рядок терміналів, mr – максимальне число продукцій, які мають той самий стартовий нетермінальний символ в правій частині. Час синтаксичного аналізу виявився майже лінійною функцією від кількості слів у реченні при розборі тестової бази речень української художньої літератури. Результати. Розроблений метод був реалізований в програмному забезпеченні UkrParser, яке доступне з відкритим вихідним кодом на GitHub. Висновки. Розроблений алгоритм був протестований на базі даних українських речень і продемонстрував в десять разів більшу швидкість розбору, ніж аналізатор «Stanford Parser». Майбутні дослідження можуть бути сфокусовані на розробці граматично доповнених онтологій для більш широкого набору предметних областей, що має поліпшити результати семантичного аналізу речень.

Текст научной работы на тему «Effective algorithm for parsing sentences using semantically attributed weighted affix context free»

UDK 004.912

Davydov M. V.1, Lozynska O. V.2, Pasichnyk V. V.3

1PhD, Associate Professor of Information Systems and Networks Department, Lviv Polytechnic National University, Lviv, Ukraine 2PhD., Assistant of Information Systems and Networks Department, Lviv Polytechnic National University, Lviv, Ukraine 3Dr.Sc., Professor, Professor of Information Systems and Networks Department, Lviv Polytechnic National University, Lviv, Ukraine

EFFECTIVE ALGORITHM FOR PARSING SENTENCES USING SEMANTICALLY ATTRIBUTED WEIGHTED AFFIX CONTEXT FREE

Context. The problem of increasing efficiency of affix grammars over a finite lattice (AGFL) is considered. AGFL is a context-free grammar with flexible and compact form of productions for parsing texts in natural languages.

Objective. The goal of the work is to increase efficiency of parsing sentences by means of AGFL with a modification that adds semantical attributes to the productions and introduces a new form of production called the "template production". This modification helps to decrease the number of productions that are required to describe a language and lets reduce the computational complexity of the parsing algorithm.

Method. A mathematical model of the template production is developed and the theorem is proved that claims that the normal form of the template production exists and the normalization procedure produces an equivalent grammar. The normal form is utilized to increase efficiency of parsing Ukrainian sentences. The template production helps to represent ontology-based rules in a short and computationally inexpensive way. The normal form of template production is studied, and an effective algorithm for parsing sentences is proposed. The

3 3

• mp • mrJ, where n is the length of input string of terminals, mp is the

maximum number of combinations of symbol and attributes that can produce the same string of terminals, and mr is the maximum number of productions that have the same starting non-terminal symbol in the right part. The growth of parsing time turned out to be almost linear function of the number of words in a sentence when parsing of sentences from the test database of Ukrainian fiction literature.

Results. The developed method has been implemented in the UkrParser software that is available open-source on GitHub.

Conclusions. The developed algorithm was tested on the database of Ukrainian sentences and demonstrated ten times faster parsing speed than Stanford parser. The future research can be focused on the development of grammatically attributed ontologies for wider set of topics that should improve results of semantical sentence parsing.

Keywords: weighted affix context-free grammar, semantic parsing, ontology-driven sentence parsing, template productions.

worst-case complexity of the proposed algorithm is o(n3 • mp • mr )

NOMENCLATURE

A - a set of all affixes;

A(Pi ) - a set of affixes that constitute domain Pf

2

A _

a power set of A;

D - a set of disjoint affix domains Dt, D={D Dgender - a domain for gender;

D

'NUMBER

- a domain for number;

PcASE - a domain for case;

Psem - a domain for semantic affixes;

G - a weighted affix grammar over a finite lattice G=(T,V,S,P,P );

n - a length of input sequence of terminals. It is equal to the number of words in input sentence;

p - a set of template and regular productions;

S - a starting symbol, S e V \ T ;

T - a set of all terminal symbols tj ;

V - a set of all symbols;

w e R+ - a multiplicative weight of production. The weight symbol is omitted where it is equal to 1.

INTRODUCTION

The problem of natural text parsing arises in such areas of computer applications as text summarization, machine translation, information retrieval, document classification, human-computer interaction, question answering systems, social networks monitoring, expert systems, etc.

© Davydov M. V., Lozynska O. V., Pasichnyk V. V., 2017

DOI 10.15588/1607-3274-2017-4-14

The task of semantic parsing is a complex problem of artificial intelligence because its comprehensive solution requires the construction of a complete human knowledge model. Although such models are currently under development [1], no viable solution is available yet.

We propose an approach for partial syntactic and semantic parsing by means of weighted affix grammar over a finite lattice (WAGFL). WAGFL uses benefits of probabilistic context-free grammars (PCFG) [2] and affix grammars over a finite lattice (AGFL) developed by C.H.A. Koster [3]. Weighted and stochastic grammars are known to be equally expressive [4], but the approach based on weights is less restrictive and thus more flexible.

This article supports an approach where semantic analysis is integrated into the syntax parsing algorithm. This approach helps to decrease the number of intermediate constructions that have to be considered. It is especially important for flexible word order languages like Ukrainian and other slavonic languages.

The main contribution of this work is an approach to effective representation of weighted affix context-free grammar using a special form of "template productions". A small review of the existing methods is given in Section 2, "template productions" and the algorithm for parsing sentences are introduced in Section 3, experiments are provided in Section 4, parsing results are given in Section 5, and the results are discussed in Section 6. 1 PROBLEM STATEMENT

The problem is to develop effective methods for integrating semantic attributes into productions of weighted

affix context-free grammar and to develop computationally effective algorithm for parsing sentences. The sentence is considered as a list of words w1w2... wn that is converted to a sequence of terminal symbols t^t2...tn of the WAGFL grammar. The sentence parsing is formulated as a problem of finding a sequence of productions that has the maximum weight and can be applied sequentially to the starting symbol S to produce the given sequence of terminals tit2...tn. The weight of the sequence is calculated as a multiplication of weights of all contained productions.

2 REVIEW OF THE LITERATURE

The problem of syntactic sentence parsing has been studied for a long time. Among many methods of parsing sentences, the approach based on generative grammars introduced by Noam Chomsky [5] is one of the most studied. Extended affix grammars (EAG) [6] and probabilistic context-free grammars (PCFG) [2] are generative grammar fundamental extensions widely used in linguistic applications nowadays.

Affix grammars, which belong to the family of two-level grammars, are a subset of augmented grammars. Productions of affix grammars are the productions that are extended with attributes. The domain of attributes is defined by a meta-grammar.

Efficient affix grammars over a finite lattice (AGFL) formalism and its parsing algorithm were developed by C. H. A. Koster [3]. The formalism imposes restrictions on a set of productions and attributes to make the parsing computationally inexpensive. However, it still leaves it expressive enough to parse most of the natural sentence structures. AGFL extensions that are based on probabilities were also studied by T. C. Smith and J. G. Cleary [7].

Our approach is based on weighted affix grammar over a finite lattice. It is close to the method introduced by C.H.A. Koster. However, we formulate lattice grammar and productions in a slightly different way what leads to a shorter form of productions and a more compact sentence parsing algorithm.

3 MATERIALS AND METHODS

For the purpose of partial semantic-syntactic parsing of sentences, a new parser was developed. It is based on the weighted affix grammar over a finite lattice. This grammar extends symbols of generative grammar with affixes what can be used to decrease the number of productions required to describe a language. Our definition of the affix grammar over a finite lattice is slightly different from the original given by C.H.A. Koster, but it has the same idea. This new definition was used to prove that some transformation rules can be applied to the grammar to speed up the process of parsing.

The weighted affix grammar over a finite lattice G is defined as a 5-tuple (r,V,S,D,P).

Regular productions have the form

( x 2A ) ( x 2A )), where A = A (D) = \ A (DJ)

/ DjeD

and (v x 2A ) represents all non-empty strings of attributed

symbols S1S2 ...Sfr,

with k> 0,

sj VJ'-J

Terminal symbols ti 6 T do not have attributes. They usually represent words of parsed sentences. For example, the word "student" can be a male or female singular noun until it is known from the context. In terms of generative grammar, it can be written in the following way:

(noun, {aF

,}) ^ (student, 0),

(n0un, {aMALE , asiNGULAR , aSTUDENT }) ^ (student, 0 ) .

An alternative form is

(n0un, {aFEMALE , aMALE , asiNGULAR , aSTUDENT }) ^ (student, 0)

It represents both cases given above. Productions that generate terminal symbols are added by a morphological parser. If some word is a homograph, the morphological parser generates one production for every meaning of the word. The weight of each production represents the admissibility of this meaning in the parsed context.

In the example above aFEMALE,aMALE,aSINGULAR are grammatical attributes, and aSTUDENT is a semantical attribute. Semantical attributes are elements of domain DSEM .

Providing regular productions for all possible combinations of affixes can be inefficient. Thus, a template form of production is introduced. This form is tailored for the needs of computationally efficient language processing.

The template production has the form

(v1, Dinh 1 ,Aset 1).. (vk, Dinh k, Aset k (v'1, Duni 1, Areq 1)

.. (v'm , Duni m, Areq m ), where Djnh 1 ,Djnh 2, ... Dinh k ^ D are domains which affixes are inherited by symbols v1 ,v2, ...,vk; Duni 1, Duni 2, ..., Duni m <z D are domains which affixes should be common for symbols v'1, v'2, ..., v'm; Aset 1, Aset 2, ■ ■ ■, Aset k Œ A are additional affixes for symbols in the left part of the production, and Areq 1, Areq 2, ■..., Areqm c A are required affixes for symbols in the right part of the production; and w is a multiplicative weight of the production.

The template form is equivalent to a set of regular productions by definition. Consider the following template and regular productions (1) and (2):

q = { (v1,D,

inh 1, Aset

yk, Dinh k, Aset k

~w (v 1, Duni 1, Areq 1 ) . (v m , Duni m, Areq m ^fj, (1)

p = {( (vhA1)..(vk,Ak (v 1 ,A 1 )...(v'm,A'm ) . (2)

Let Auni (p,q) denote the intersection of all attributes that should be uniform in the right part of regular production p in order to conform to template production q :

(j,Aj )

(vvJ,AJ), Aum (p,q) = (j(A;, u A(DunH))r. A(DtmiLm)

V x 2A

A (Duni i ) = A \ A (Duni i ), Dum = Dl u D2 u ... u Dm.

We say that regular production p conforms to template production q if requirements R1-R3 are met:

R1. (Vj e 1...m)Dj e Dumi 1.. m ^ Amt(p,q)n A(Dy )* 0;

R2. (Vi el...m)Areqt с A\ ;

R3. (Vi e L..k)At = A,ef i U (Aumi (p,qЬ A{Dimh i)).

Requirement R1 assures that for each unified domain there is at least one common affix. Requirement R2 describes how required attributes are treated, and requirement R3 states how attributes of symbols in the left part of the production are obtained.

For instance, the Ukrainian equivalent of the English noun phrase "BEAUTIFUL STREET OF THE CITY" is "ГАР -НА ВУЛИЦЯ М1СТА". In this noun phrase, the case, gender, and number of the adjective (ГАРНА) is coordinated by the case, gender, and number of the first noun (ВУЛИЦЯ), and the case of the second noun (М1СТА) should be GENITIVE. Semantical attribute for the whole phrase is taken from the word "STREET". The template production for this phrase in Ukrainian is

(np,{Dgender , DNUMBER, DCASE, Dsem },0) ^ (ADJ,{DGENDER ,DNUMBER, dcase },0)

(NP, {DGENDER,DNUMBER,DCASE,DSEM },0)(nP,0, {aGENITIVE}),

and the English equivalent is

(np,{Dnumber,Dsem },0) "> (ADJ,0,0)

x {Dnumber , Dsem }, 0)(prep, 0,{aoF })(NP,0,0),

where NP stands for noun phrase, ADJ stands for adjective,

dgender , dnumber , dcase , DSEM are domains for gender, number, case, and semantic affixes, respectively.

The Normal Form of Template Productions. The length of the right-hand side of a production is called its rank. Effective parsing of sentences using generative grammars can be achieved when the grammar is in Chomsky normal form (CNF) - the form that ensures that all productions of the grammar have the rank not more than 2. Template productions can also be converted to a form that has at most two symbols in the right part. This conversion is performed by applying simplification steps to all productions that have the rank greater than 2. Every step takes one template production with the rank m> 2 and produces two template productions - one with the rank 2 and one with the rank m -1. The process stops when there are no more productions with the rank 3 and above.

The simplification step takes one template production q

of the form (1) and produces 2 template productions:

qi = ^(N, Dmh 1, Aset 1 )•• к, Dmh k, Aset k '

D.

uni 1

,A

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

req

i Xv

,D.

uni 2..

0

q2

(V'2

2..m ' Duni 2..m

1.0 (

2 ' Duni 2 >Areq 2

Ar

( m ' Duni m ' Areq m

)

where Duni 2..m = Duni 2 u Duni 3 u... u Dunim , and v'2..m is a new non-terminal symbol.

Theorem 1: The grammar obtained from original grammar G by the replacement of template production q with template productions qi and q2 produces the same language.

In order to prove this, it is sufficient to show that:

1) all regular productions of form (2), which conform to

template production (1), can be split into 2 productions pi

and P2 that conform to template productions qi and q2 , respectively;

2) all productions that conform to template productions q1 and q2 define the same grammar as productions that conform to template production q.

1) The First Part of the Proof: Given that production p conforms to template production q. It should be proven that there exists a split of p into 2 productions P1 and P2

such that p1 conforms to q1 and p2 conforms to q2. This split can be found by the assignment

Pi ={ (vi 'Ai ).. (vk'Ak ) (v'i 'A'i )v'2..m'A'2..m)

and

P2 (v '2..m 'A2..m ) (v 2 >A'l )■■■(( ,A'm )) ,

A2..m = A'2..m = Auni (p2,?2). In this case, production P2 conforms to production q2 because

Auni (p,q) = (((awa(d

uni 1

b(A

uni (P2 >q2

^ A(Duni 2..m ))))^ A(Puni 1..m ). Therefore, Auni

(p,q)n A(P

uni 2.. m uni (P2 .q2)

what means that if Auni (p,q) satisfies requirement R1, then

Auni (p2, q2 ) also satisfies it. Requirement R2 is satisfied

because Areq2, ■■■,Areqm and A'2, ...,A'm are taken from

productions q and p , respectively, and p conforms to

q by the assumption of the theorem. Requirement R3 is satisfied due to the fact that

A'2..m = 0 u (Auni (p2 ,q2 )n A(Puni 2..m ))= Auni (p2 q2 ).

2) The Second Part of the Proof: Given that pi conforms to qi and p2 conforms to q2, it can be proved that they can be composed into a single production that conforms to q.

First of all, it should be noted that symbol v'2 m is a new non-terminal symbol, and thus it can't be found in any

other production. Productions p1 and p2 can be applied sequentially only when A2 m = A'2 m. Due to the fact that p2 conforms to q2 , requirement R3 ensures that

A2..m = 0 u (Aum (2 q )r A\Pun, 2..m ))= Aun, (p2q). So, Auni (p1, q1 ) can be calculated using the formula Auni (p1 ,q1

)= (((a'1U A(Dun! 1 ))r

n \Auni (p2,q2)u A(D

uni 2..m W'r A(Dum 1.. m )=

=(a 1u A(uni 1 ))r^ ( .m (a, u ~A(Duni i ))) r A{Duni 1 ..m / Auni

Therefore, requirements R1 and R3 are satisfied for productions p and q because they are satisfied for pi and qy, requirement R2 follows from the fact that pi and P2 conform to qi and q2, respectively. Thus, production p, that is obtained from py and p2, conforms to template production q. This concludes the proof of Theorem 1.

Algorithm for Parsing Sentences. The problem of sentence parsing is formulated as a problem of finding a sequence of productions that has the maximum weight and can be applied sequentially to some starting attributed

symbol (S,As) to produce a given sequence of terminals Í1Í2..Í«. The weight of the sequence is calculated as a multiplication of weights of all contained productions. If the right part of a production contains only one symbol, the weight of the production should not exceed 1 in order to avoid cyclic productions that can increase weight of non-terminal symbols during the bottom-up parsing procedure.

The developed algorithm for parsing sentences is based mostly on probabilistic CYK algorithm. The main difference is that symbols are compared not only by weight but also by the set of affixes. The algorithm uses the notion of

weighted attributed symbol - a 3-tuple (w,v,Av) that

contains weight w, symbol v, and set of affixes Av c A(D).

We say that weighted attributed symbol (wy,vy,Ay)

dominates weighted attributed symbol (2 v,A2) if wy > W2, vy = V2, and A2 c A\.

In the worst-case scenario, the computational complexity 3 3

of the algorithm is O(n • mp • mr ), where n is the length of input string of terminals, mp is the maximum number of combinations of symbols and attributes that can produce the same string of terminals (this value can be treated as the ambiguity of the language being parsed), and mr is the maximum number of productions that have the same starting non-terminal symbol in the right part.

The parsing algorithm can be described by the following pseudocode:

Algorithm ParseSentense(s).

Input. String of terminals s = tyt2...tn.

Output. Sequence of productions that produce string s.

^ttps^/github.com/mdavydov/UkrParser

2https://sourceforge.net/projects/ispell-uk/

Let P[1.. n,l..n]=0 be an array, each element of which is a set of weighted attributed symbols.

Initialize P [i, 1] ={(l,ti,0)}, i = 1,2, ...,n.

for j = 1,2,..., n do // j is a length of substring of terminals

for k = 1,2,..., n - j +1 do // k is a start of substring for 5 = 1,2,..., j -1 do // 5 is a split of the substring

for all (w1,v1,A1)e P[k,5] do for all productions of type

( Dinh, Aset) (v1, Duni 1, Areq 1 )(v2 ,Duni 2 ,Areq 2 ) do

for all (w2,v2,A2)eP[k + s,j-s] do

if (Areq 1 c A1 a((VD; 6 Dun, 1A ^A(D;)^0)) then

and ((vD, 6 (Dum 1 n Dum 2 M n A2 n A(d, ) ^ 0)) then Let

t =(w Mj,v,4et^((A1 ^A(Dunг1))n(A2 ^A(Dunг2))nA(Dinh)))

if t is not dominated by any element of P[k, j],

then add t to P[k,j] and remove elements dominated by t

Let list L := allelementsofP[k, j]

for all (Wj ,Vj, Aj) 6 L do// process productions of rank 1

f°r aH productions of type (v, Dinh, Aset)"" ((1, Dml, Areq )

do

if (Areq c A1 a ((VDi 6 Duni n A(d, ) * 0) then let t= (w • W1,v,Aset u (A1 n A(Dmh))) if t is not dominated by any element of L, then append t to L. Add elements of L to P[k,j].

If P[1 , n] doesn't contain any triple (w,S,As ), where S is a starting symbol of the grammar, the parsing is impossible. If it does, select a triple with the maximum weight w among them and reconstruct all productions that are required to produce string t^...^. 4 EXPERIMENTS

The algorithm for parsing sentences was implemented in UkrParser1 open source software project. This project contains classes for morphology analysis and sentence parser. The morphology for Ukrainian language is implemented in com.langproc. Ukrainian! SpellMorphology and com.langproc.UkrainianGrammarlyMorphology clases. The first class is based on open source project iSpell-uk2 by Andriy Rysin and the second is based on Ukrainian morphology database gracefully provided by Mariana Romanyshyn from Grammarly. The algorithm for parsing sentences is implemented in com.langproc.APCFGParser class and productions for parsing Ukrainian sentences are placed in com.langproc.APCFGUkrainian class.

Computational efficiency of the developed algorithm was tested on database of 500 sentences from "Fata Morgana" story by Michael Kotsyubynsky. The average sentence parsing time depending of the sentence length is depicted in Fig. 1. These results were obtained on computer with 2.4 GHz Intel Core i5 CPU. The parsing time grows turned out to be almost linear notwithstanding the worst-case cubic estimate provided in Section 3. 5 RESULTS

The developed approach for mixed semantic and syntactic sentence parsing was used for parsing and translation of the annotated Ukrainian Sign language and the Ukrainian Spoken language [8], where the translation based on the parser that utilized productions generated from ontologies outperforms the parser that utilized only syntactic productions by 25% (90% of correct translations as compared to 65% correct translations obtained when using only syntactical productions).

An example of parsing sentence "Моя донька ходить у дитячий садок" (My daughter attends nursery school) by means of the developed method is shown in Figure 2.

In this example the following rules were added from subject area "Education":

NG(персона)[=] -> NG^HbKa)[=];

УР(ходити-ввдв^вати)[=] -> УР(ходити)[=];

NG(дошкiльний-заклад)[=] 1.1-> adj(дитячий)[=] АЫ(садок)[=];

DS(ходити-вiдвiдувати)[=] 1.1-> -> NP(nep^Ha)[=] УР(ходити-вщв^увати)[=] V DNP(дошкiльний-заклад)[c4];

where NG stands for Noun Group, NP - Noun Phrase, VP -Verb Phrase, adj - adjective, AN - annotated noun, DS -Declarative Sentence, V - preposition of place, "=" means default grammar attributes for current phrase, c4 stands for "Casus 4". The weights of ontology-driven rules are

intentionally made greater than 1 to supersede default syntactical rules.

6 DISCUSSION

The experimental results on database of Ukrainian sentences show significant speed-up in comparison with well-known context-free grammar parsers. This result was achieved by using compact form of production with syntactical and semantical attributes. In comparison with Stanford Parser3 the average sentence parsing time was decreased in about 10 times.

CONCLUSIONS

This article demonstrated an efficient algorithm for parsing sentences by means of weighted affix context-free grammar with semantical attributes. The developed algorithm is based on the normal form of "template productions" that were introduced. The algorithms has worst-case cubic complexity, that turned out to be almost linear in real example.

The obtained sentence parsing trees are more semantically rich than the parsing trees obtained by means of regular syntactic parser. Additional computational cost for that is not very high because only hypernyms of words that are present in the sentence and corresponding expressions are included into the grammar.

The future research will be focused on optimal weight assignment and automatic extraction of productions that are specific for particular subject area. ACKNOWLEDGEMENTS

The work was supported by the state budget scientific research project of Lviv Polytechnic National University and Ministry of Education and Science of Ukraine "Mathematical and algorithmic modeling of translation processes from gestures into text for specialized computer systems" (state registration number 0111U001222).

E,

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

o> £

cn с

У)

re Q.

CD О

с

ф

с CD

сО

CD СП

го

CD >

<

1.3 1.6

1.4

Е 1.2

0.8 0.6 0.4 0.2

Number of words in sentences

Figure 1 - Average sentence parsing time in milliseconds depending of the sentence length

Figure 2 - The result of parsing of the Ukrainian sentence "Моя донька ходить у дитячий садок" (My daughter attends nursery school)

REFERENCES

ConceptOnto: An upper ontology based on ConceptNet / [E. Najmi, K. Hashmi, Z. Malik et al.] // Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA), November 10-13, Doha, Qatar. - 2014. - P. 366-372. DOI: 10.1109/AICCSA.2014.7073222.

Eddy S.R. RNA sequence analysis using covariance models / S. R. Eddy, R Durbin // Nucleic Acids Research. - 1994. - Vol. 22. -№ 11. - P. 2079-2088. Koster C.H.A. Affix Grammars for natural languages / C.H.A. Koster // In: Attribute Grammars, Applications and Systems, International Summer School SAGA, Lecture Notes in Computer Science, Prague, Czechoslovakia. - 1991. - Vol. 545. - P. 469-484. Smith N. A. Weighted and Probabilistic Context-Free Grammars Are Equally Expressive / N. A. Smith, M. Johnson // Computational Linguistics. -2007. - Vol. 33, №> 4. - P. 477-491. D0I:10.1162/coli.2007.

Chomsky N. Three models for the description of language / N. Chomsky // IRE Transactions on Information Theory -№ 2 (3). - 1956. - P. 113-124. D01:10.1109/TIT.1956.1056813. Oostdijk N. An Extended Affix Grammar for the English Noun Phrase / N. Oostdijk // In: Jan Aarts and Wim Meijs (eds), Corpus Linguistics. Recent Developments in the Use of Computer Corpora in English Language Research, Amsterdam: Rodopi. - 1984. Smith T.C. Probabilistic Unification Grammars / T. C. Smith, J. G. Cleary // In Australasian Natural Language Processing Summer Workshop. - 1997. - P. 25 32.

Lozynska O. Information technology for Ukrainian Sign Language translation based on ontologies / O. Lozynska, M. Davydov // ECONTECHMOD: an international quarterly journal on economics of technology and modelling processes. - 2015. -Vol. 04, No. 2. - P. 13-18.

Article was submitted 22.09.2017. After revision 23.10.1017.

Давидов М. В1, Лозинська О. В.2, Паачник В. В.3

'Канд. техн. наук, доцент кафедри «1нформацшш системи та мережЬ», Нацюнальний ушверситет «Львiвська полггехшка», Львiв, Украша

2Канд. техн. наук, асистент кафедри «1нформацшш системи та мережЬ», Нацюнальний ушверситет «Львiвська полггехшка», Львiв, Украша

3Д-р техн. наук, професор, професор кафедри «1нформацшш системи та мережЬ», Нацюнальний ушверситет «Львiвська полггехн-жа», Львiв, Украша

ЕФЕКТИВНИЙ АЛГОРИТМ ДЛЯ СИНТАКСИЧНОГО АНАЛ1ЗУ РЕЧЕНЬ З ВИКОРИСТАННЯМ СЕМАНТИЧНО ПО-ЗНАЧЕНИХ ЗВАЖЕНИХ АФ1КСНИХ КОНТЕКСТНО-В1ЛЬНИХ ГРАМАТИК

Актуальтсть. Розглядаеться задача шдвищення ефективносп афжсних граматик над скшченною граткою (AGFL). AGFL - це контекстно-вшьна граматика з гнучкими i компактними формами для розбору текспв на природних мовах.

1

7

2

8

3

Мета роботи. Метою роботи е шдвищення ефективност розбору речень за допомогою модифжаци AGFL, яка додае семантичш атрибути в продукци граматики i вводить нову форму продукцш шд назвою «шаблонна продукщя». Ця модифжащя допомагае зменшити кшьюсть продукцiй, необхiдних для опису мови, i дозволяе зменшити обчислювальну складнiсть алгоритму синтаксичного аналiзу.

Метод. Розроблено математичну модель шаблонно! продукци i доведено теорему про те, що кнуе нормальна форма шаблонних продукцiй, а процедура норматзаци породжуе е^валентну граматику. Нормальна форма використовуеться для шдвищення ефективност розбору укра!нських речень. Шаблоннi продукци допомагають описувати правила на основi онтологи в короткш i обчислювально

ефективнiй формг Вивчаеться нормальна форма шаблонних продукцш i пропонуеться ефективний алгоритм для розбору речень. У

( 3 3 )

найгiршому випадку обчислювальна складнють запропонованого алгоритму становить O\n ■ 1Пр ■ mrJ, де n - довжина вхщного рядка

термшашв, mp - максимальне число комбшацш символiв i атрибутов, яю можуть породжувати один i той самий рядок термшашв, mr -максимальне число продукцш, яю мають той самий стартовий нетермiнальний символ в правш частинi. Час синтаксичного аналiзу виявився майже лiнiйною функщею вiд кiлькостi слiв у реченш при розборi тестово! бази речень украшсько! художньо! лiтератури.

Результати. Розроблений метод був реалiзований в програмному забезпеченш UkrParser, яке доступне з ввдкритим вихiдним кодом на GitHub.

Висновки. Розроблений алгоритм був протестований на базi даних укра!нських речень i продемонстрував в десять разiв бiльшу швидкiсть розбору, нiж аналiзатор «Stanford Parser». Майбутнi дослiдження можуть бути сфокусоваш на розробцi граматично допов-нених онтологiй для бшьш широкого набору предметних областей, що мае полшшити результати семантичного аналiзу речень.

Ключов1 слова: зважена афшсна контекстно-вiльна граматика, семантичний розбiр, розбiр речень з використанням онтологiй, шаблонна продукщя.

Давыдов М. В.1, Лозинская О. В.2, Пасечник В. В.3

'Канд. техн. наук, доцент кафедры «Информационные системы и сети», Национальный университет «Львовская политехника», Львов, Украина

2Канд. техн. наук, ассистент кафедры «Информационные системы и сети», Национальный университет «Львовская политехника», Львов, Украина

3Д-р техн. наук, профессор, профессор кафедры «Информационные системы и сети», Национальный университет «Львовская политехника», Львов, Украина

ЭФФЕКТИВНЫЙ АЛГОРИТМ ДЛЯ СИНТАКСИЧЕСКОГО АНАЛИЗА ПРЕДЛОЖЕНИЙ С ИСПОЛЬЗОВАНИЕМ СЕМАНТИЧЕСКИ ОБОЗНАЧЕННЫХ ВЗВЕШЕННЫХ АФФИКСНЫХ КОНТЕКСТНО-СВОБОДНЫХ ГРАММАТИК

Актуальность. Рассматривается задача повышения эффективности аффиксных грамматик над конечной решеткой (AGFL). AGFL - это контекстно-свободная грамматика с гибкой и компактной формой записи продукций для разбора текстов на естественных языках.

Цель работы. Целью работы является повышение эффективности разбора предложений с помощью модификации AGFL, которая добавляет семантические атрибуты в продукции грамматики и вводит новую форму продукций под названием «шаблонная продукция». Эта модификация помогает уменьшить количество продукций, необходимых для описания языка, и позволяет уменьшить вычислительную сложность алгоритма синтаксического анализа.

Метод. Разработана математическая модель шаблонной продукции, и доказана теорема о том, что существует нормальная форма шаблонных продукций, и процедура нормализации порождает эквивалентную грамматику. Нормальная форма используется для повышения эффективности разбора украинских предложений. Шаблонные продукции помогают описывать правила на основе онтологии в краткой и вычислительно эффективной форме. Изучается нормальная форма шаблонных продукций, и предлагается эффективный алгоритм для разбора предложений. В наихудшем случае вычислительная сложность предлагаемого алгоритма составляет

( 3 3 \

O\n ■ mp ■ mrJ, где n - длина входной строки терминалов, mp - максимальное число комбинаций символов и атрибутов, которые

могут порождать одну и ту же строку терминалов, и mr - максимальное число продукций, которые имеют тот же стартовый нетерминальный символ в правой части. Время синтаксического анализа оказалось почти линейной функцией от количества слов в предложении при разборе тестовой базы предложений украинской художественной литературы.

Результаты. Разработанный метод был реализован в программном обеспечении UkrParser, которое доступно с открытым исходным кодом на GitHub.

Выводы. Разработанный алгоритм был протестирован на базе данных украинских предложений и продемонстрировал в десять раз большую скорость разбора, чем анализатор "Stanford Parser". Будущие исследования могут быть сфокусированы на разработке грамматически дополненных онтологий для более широкого набора предметных областей, что должно улучшить результаты семантического анализа предложений.

Ключевые слова: взвешеная контекстно-свободная аффиксная грамматика, семантический разбор предложений, разбор предложений на основе онтологий, шаблонная продукция.

REFERENCES

1. Najmi E., Hashmi K., Malik Z., Rezgui A., Khanz H. U. ConceptOnto: An upper ontology based on ConceptNet, Computer Systems and Applications (AICCSA), 2014 IEEE/ACS Hth International Conference on Computer Systems and Applications (AICCSA), November 10-13, Doha. Qatar, 2014, pp. 366-372. DOI: 10.1109/AICCSA.2014.7073222.

2. Eddy S. R., Durbin R. RNA sequence analysis using covariance models, Nucleic Acids Research, 1994, Vol. 22, No. 11, pp. 2079-2088. Koster C.H.A. Affix Grammars for natural languages / C.H.A. Koster, In: Attribute Grammars, Applications and Systems, International Summer School SAGA, Lecture Notes in Computer Science, Prague, Czechoslovakia, 1991, Vol. 545, pp. 469-484. Smith N. A., Johnson M. Weighted and Probabilistic Context-Free Grammars Are Equally Expressive, Computational Linguistics,

3

4

2007, Vol. 33, No. 4, pp. 477-491. D0I:10.1162/coli.2007. Chomsky N. Three models for the description of language, IRE Transactions on Information Theory, No. 2 (3), 1956, pp. 113— 124. D0I:10.1109/TIT. 1956.1056813.

Oostdijk N. An Extended Affix Grammar for the English Noun Phrase, In: Jan Aarts and Wim Meijs (eds), Corpus Linguistics. Recent Developments in the Use of Computer Corpora in English Language Research, Amsterdam, Rodopi, 1984. Smith T. C., Cleary J. G. Probabilistic Unification Grammars, In Australasian Natural Language Processing Summer Workshop, 1997, pp. 25-32.

Lozynska O., Davydov M. Information technology for Ukrainian Sign Language translation based on ontologies, ECONTECHMOD: an international quarterly journal on economics of technology and modelling processes, 2015, Vol. 04, No. 2, pp. 13-18.

7

8

i Надоели баннеры? Вы всегда можете отключить рекламу.