Научная статья на тему 'SYNTACTIC MODELS FOR PARSING OF UZBEK CORPUS'

SYNTACTIC MODELS FOR PARSING OF UZBEK CORPUS Текст научной статьи по специальности «Языкознание и литературоведение»

CC BY
252
73
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
syntactic markup / relationship / grammatical classification / word combination / nominal and verbal adjournment / government / agreement / синтаксическая разметка / отношения / грамматическая классификация / словосочетание / именная и глагольная / согласование / управление / присоединение

Аннотация научной статьи по языкознанию и литературоведению, автор научной работы — Abdurakhmonova Nilufar Zaynobiddin

the article deals with syntactic markup of Uzbek corpus. According to syntactic tagging as linguistic models of the word combinations divides two main types: nominal and verbal. Connecting of each words of models has three syntactic ways: agreement, government, adjoinment. In this paper analyzed syntactic relations and models in order to create meta language for NLP. Syntactic parsing is crucial stage among existing different types of parsing methods in the field of NLP. Syntactic parsing assists to identify the type sentence and word combinations that represented grammatical relations of the words. However, there are various grammatical features of the languages, almost all languages follow common linguistic rules. The Uzbek language belongs to agglutinative language family based on free constituent order language in syntax. Our investigations show that morphological aspect of word forms plays an essential role to identify and compose syntactic relations for the Uzbek language. Given morphological and lexical information can solve the some problems which connecting with syntactic parsing as well. Our article represents some main point of views the stages of parsing on CoNLLU format based on Uzbek corpus analysis.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

СИНТАКСИЧЕСКИЕ МОДЕЛИ ДЛЯ ПАРСИНГА УЗБЕКСКОГО КОРПУСА

в статье рассматривается синтаксическая разметка корпуса узбекского языка. По синтаксической маркировке в качестве языковых моделей словосочетания делятся на два основных типа: именные и глагольный. Связывание каждого слова моделей осуществляется тремя синтаксическими способами: согласование, управление, присоединение. В этой статье проанализированы синтаксические отношения и модели для создания метаязыка для NLP. Синтаксический анализ является важнейшим этапом среди существующих различных типов методов анализа в области НЛП. Синтаксический анализ помогает идентифицировать тип предложения и словосочетания, которые представляют грамматические отношения слов. Однако существуют различные грамматические особенности языков, почти все языки подчиняются общим лингвистическим правилам. Узбекский язык принадлежит к семейству агглютинативных языков, основанному на синтаксисе свободного порядка составных частей. Наши исследования показывают, что морфологический аспект словоформ играет важную роль в выявлении и составлении синтаксических отношений для узбекского языка. Данная морфологическая и лексическая информация может решить некоторые проблемы, связанные с синтаксическим анализом. В нашей статье представлены некоторые основные точки зрения на этапы парсинга в формате CoNLLU на основе узбекского корпусного анализа.

Текст научной работы на тему «SYNTACTIC MODELS FOR PARSING OF UZBEK CORPUS»

ФИЛОЛОГИЧЕСКИЕ НАУКИ

SYNTACTIC MODELS FOR PARSING OF UZBEK CORPUS Abdurakhmonova N.Z. Email: Abdurakhmonova6103@scientifictext.ru

Abdurakhmonova Nilufar Zaynobiddin kizi - PhD, Associate Professor, DEPARTMENT OF APPLIED LINGUISTICS AND LINGUODIDACTICS, TASHKENT STATE UNIVERSITY OF UZBEK LANGUAGE AND LITERATURE NAMED AFTER ALISHER NAVOI, TASHKENT, REPUBLIC OF UZBEKISTAN

Abstract: the article deals with syntactic markup of Uzbek corpus. According to syntactic tagging as linguistic models of the word combinations divides two main types: nominal and verbal. Connecting of each words of models has three syntactic ways: agreement, government, adjoinment. In this paper analyzed syntactic relations and models in order to create meta language for NLP.

Syntactic parsing is crucial stage among existing different types of parsing methods in the field of NLP. Syntactic parsing assists to identify the type sentence and word combinations that represented grammatical relations of the words. However, there are various grammatical features of the languages, almost all languages follow common linguistic rules. The Uzbek language belongs to agglutinative language family based on free constituent order language in syntax. Our investigations show that morphological aspect of word forms plays an essential role to identify and compose syntactic relations for the Uzbek language. Given morphological and lexical information can solve the some problems which connecting with syntactic parsing as well. Our article represents some main point of views the stages of parsing on CoNLLUformat based on Uzbek corpus analysis. Keywords: syntactic markup, relationship, grammatical classification, word combination; nominal and verbal adjournment, government; agreement.

СИНТАКСИЧЕСКИЕ МОДЕЛИ ДЛЯ ПАРСИНГА УЗБЕКСКОГО

КОРПУСА Абдурахмонова Н.З.

Абдурахмонова Нилуфар Зайнобиддин кизи - кандидат филологических наук, доцент, кафедра прикладной лингвистики и лингводидактики, Ташкентский государственный университет узбекского языка и литературы им. Алишера Навои, г. Ташкент, Республика Узбекистан

Аннотация: в статье рассматривается синтаксическая разметка корпуса узбекского языка. По синтаксической маркировке в качестве языковых моделей словосочетания делятся на два основных типа: именные и глагольный. Связывание каждого слова моделей осуществляется тремя синтаксическими способами: согласование, управление, присоединение. В этой статье проанализированы синтаксические отношения и модели для создания метаязыка для NLP. Синтаксический анализ является важнейшим этапом среди существующих различных типов методов анализа в области НЛП. Синтаксический анализ помогает идентифицировать тип предложения и словосочетания, которые представляют грамматические отношения слов. Однако существуют различные грамматические особенности языков, почти все языки подчиняются общим лингвистическим правилам. Узбекский язык принадлежит к семейству агглютинативных языков, основанному на синтаксисе свободного порядка составных частей. Наши исследования показывают, что морфологический аспект словоформ играет важную роль в выявлении и составлении синтаксических отношений для узбекского языка. Данная морфологическая и лексическая информация может решить некоторые

проблемы, связанные с синтаксическим анализом. В нашей статье представлены некоторые основные точки зрения на этапы парсинга в формате CoNLLU на основе узбекского корпусного анализа.

Ключевые слова: синтаксическая разметка, отношения, грамматическая классификация, словосочетание, именная и глагольная, согласование, управление, присоединение.

UDC 81 '33

Introduction

One of the linguistic properties of languages for natural language processing is syntax. There are two crucial components as constituencies of syntax: word combination and sentence. Syntactic parsing is crucial technology for each application of natural language processing: machine translation, question-answering system, information retrieval system and sentiment analysis, corpus linguistics. Consequently, building of the structure of text and word combinations plays essential role in order to identify the place of parts of speech. Each language has own linguistic peculiarities as according to typological system of languages. For example inflectional and agglutinative, having own ontological classification of parts of speech.

Word combination represents the combinations of words. Words belong to things and substance, quality, attribute, and action. Things and substance, quality, substance, attribute, and action interconnect each other in word combination, but they cannot apart from independently each other. Syntax of word combination is capability of adjoining of words that estimated as connection ways and schemata (forms) as well as components and forms of word combination associates closely with morphology. It is studied word combination as a part of sentence and postfixes considered as morphological-syntactical category that joining each other's.

Word combination comprises semantically and grammatical attitudes of at least two words. One is component of word combination comes as head (governor) and other dependent word. Components interact each other's based on semantically and syntactical rules. Word combination plays role as nominative means of language by headword naming things, substance, quality, substance, attribute, and action interconnect each other's.

However, grammatical features can identify the functions of words in sentence there is not rigid word order in the sentence of agglutinative languages because of free place of parts of speech. For example in Uzbek it can be seen changeable position each component what focused part comes in the front of predicate:

Men bugun avtobusda universitetga boraman.

Men bugun universitetga avtobusda boraman.

Men universitetga avtobusda bugun boraman.

Predictable matching the role of parts of speech for Turkic languages seem difficult due to free placed in the text. However each language has own formal model can be used for parsing including several stages linguistically.

II. Related work

Syntactic models in agglutinative languages bind with morphological features and additionally [2, 3, 4, 5] works focused on about it. Three areas of work are essential for metadata to perform its functions: semantics to define the meaning of data, syntax to specify the data binding structure, and vocabulary to control the language [1, 2002]

Agreement is based on formal correspondence between members of a syntactic group in person and number of governor and subordinate. The definition in Uzbek there is some distinction that the model [NOUN+NOUN+Case] [Noun+POSS]: talabalarning hammasi in Uzbek, the predicate agrees with the subject in person and number: Men o 'quvchiman (Uzbek).

Government occurs by the case and particles of both languages according to what is nominal or verbal head word: ukam uchun kitob, estalikka sovg 'a, bog 'dagi gullar (Uzbek).

Adjoinment can be expressed the words joined by the common grammatical function and meaning without any change in morphological forms between them: qalin o 'rmon, oq paxta, buyuk tarix..

Syntactic relation in word combinations in Uzbek divides two types: nominal and verbal. In nominal word combination, nominal (adjective, noun, numeral, and pronoun) is considered governor. In our work we used universal POS tagging for both morphological and syntactic analysis (Table 1).

Table 1. Universal POS for Uzbek corpus

Tag Name_English UZB

WC Word combination ukam uchun sotib olmoq

COLC Collocation o'z yog'iga qovurilmoq

FP\FCOLC Free phrase\ Free collocation xat yozmoq, kuchli iroda, ukamning kitobi

NP Noun Phrase bolalaming hammasi, intizomda birinchi, xushbo'y hid

NA Noun Adjoinment ona vatan, bebaho sovg'a

NG Noun Government ukam uchun sovg'a, senga mukofot

NCS Noun Collateral subordination ukamning xati

VP Verb Phrase baland uchmoq, kulib gapirmoq

AGRM Agreement u o'quvchi, mel keldim

SLP Singular personal pronouns Men talabaman

PPL Plural personal pronouns Ular talabalar

To study probability of word combination modeling of parts of speech is useful for distribute to some group of syntactic relations. If we consider one example as classes and subclasses of words then we can see there are several kinds of types of models of word combinations in Uzbek: Nominal adjointment:

1. Noun+Noun=> temir uskuna

2. ADJ+Noun=> qulay imkoniyat

3. P+Noun=>hamma ishtirokchilar

4. Num./Noun=> birinchi kun

5. Gerund+Noun=> o' qiyotgan qiz

6. Infinitive+ Noun=>nishonlash kuni

7. ADV+Noun=> sekin harakat

8. (Noun+dagi)+ Noun=> devordagi rasm

9. (Infinitive+dagi)+ Noun=> ishlashdagi g'ayrat

10. (ADV+dagi)+ Noun=> yuqoridagi qavat

11. |P+|ADV +Gerund+Noun=> (kimgadir) (sekin) o'qib berayotgan qiz

12. Noun|P{ni, ga, da, dan}+Gerund+Noun=> maktabga ketayotgan qiz

13. ADJ+Gerund|Past participle+Noun=> yaxshi o'qigan bola

14. ADV+Gerund+Noun=>tez kelgan lahza

15. (Noun+day|dek)+ADJ=> oyday oppoq

16. (Noun+dagi)+ADJ=> sinfdagi a'lochi

17. ADJ+Num.=> mo'jizaviy yetti

18. (Noun+dagi)+Num.=> rasmdagi bir

19. Noun+Infinitive=>kitob o'qish

20. ADJ+Infinitive=> qulay joylashish

21. ADV+Infinitive=> tez yeyish Verbal adjoinment:

1. sifat+fe'l=> yaxshi o'qimoq

2. ravish+fe'l=> astoydil o'qimoq

3. ravishdosh+fe'l=> kulib gapirmoq

4. ADJ+Verb=> yaxshi o'qimoq

5. ADV+Verb=> astoydil o'qimoq

6. Gerund+Verb=> kulib gapirmoq Nominal government:

1. Noun+ dan+Noun=> Andijondan kelish

2. Noun+jdan ham |jdan ko'ra+ADJ+jroq=>onadan mehribon

3. Noun+dan+Infinitive=>ustozdan so' rash

4. Gerund+ {Noun} dan +Infinitive => bilgandan so'rash

5. Gerund+dan+ADJ=>ko'rgandan gumon<=>ADJ+{Prep.}+Gerund=>doubful of seeing.

6. Noun | P + dan + Num.=>hammadan birinchi

7. Noun | P + dan +ADJ=>hammadan ustun

8. ADJ + jlar+dan+Num.=> a'lochilardan ikkitasi

9. ADV + dan+ADV=> kechagidan erta

10. ADV + dan+Infinitive=>ko'pdan bilish

11. Num.+dan+Num.=>yuztadan bittasi

12. Noun+ga+Noun=> vatanga muhabbat

13. P+ga+Noun=> hammaga do'st

14. Gerund+ga+Noun=>o'qiyotganga omad

15. Infinitive+ga+Noun=>o'qishga mehr

16. Infinitive|Noun+ga+Infinitive=>o'qishga intilish

17. Noun+ga+ADV=>bayramga yaqin

18. Noun+da+Noun=>yozuvda xato

19. Noun+da+Num.=>tartibda birinchi

20. Noun|P+da+ADJ=>menda ko'p

21. ADJ+ni+Infinitive=> qahramonni eslash

22. Noun+ni+Infinitive=>farzandni sog'inish Verbal government

1. Noun+ga+Verb=>maktabga bormoq

2. Noun+ga+Infinitive=>daftarga yozmoq

3. Noun|P+dan+Verb=>universitetdan qaytmoq

4. Noun|Pronoun +ni+Verb=>hikoyani o'qimoq

5. Noun +ni+ravishdosh=>ishni bajarib

6. Noun+da+Verb=>maktabda o'qimoq

7. Noun+da+Gerund=>osmonda uchib kelayotgan

Taking into consideration syntactic structures of word combination analyzed sentences via parts of speech.

METHODOLOGY OF PARSING

Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages.

The corpus consists of hand built selection of Uzbek fiction annotation with metadata respectively by genres. Here grammatical categories are crucial to give representativeness of features of parts of speech.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

We apply Turkish model to analyze the texts for CoNLLU format, hence there have been the sharp distinction between Turkish and Uzbek structures, but thank to by human correction, grammatical features tagging improved according to

# newpar

# sent_id = 266

# text = Shoir yigitga dil-dildan achinarkan, uni ilk bor uchratgan paytini esladi .

1 Shoir Shoir NOUN Noun Case=Nom|Number=Sing|Person=3 2 nmod _ SpacesAfter=\r\n

2 yigitgayigitgaNOUN Noun Case=Nom|Number=Sing|Person=3 3 nmod _ SpacesAfter=\r\n

3 dil dil NOUN Noun Case=Nom|Number=Sing|Person=3 13 nsubj _ SpaceAfter=No

4 - - PUNCT Punc _ 13 punct _ SpaceAfter=No

5 dildan dil NOUN Noun Case=Abl|Number=Sing|Person=3 6 obl _ SpacesAfter=\r\n

6 achinarkan achin VERB Verb Aspect=Perf|Mood=Ind|Polarity=Pos|Tense=Pres|VerbForm=Part 13 acl _ SpacesAfter=\r\n

7 , , PUNCT Punc _ 13 punct _ SpacesAfer=\r\n

8 uni u NOUN Noun Case=Acc|Number=Sing|Person=3 11 obj _ SpacesAfter=\r\n

9 ilk ilk ADJ Adj _ 10 amod _ SpacesAfter=\r\n

10 borbor NOUN Noun Case=Nom|Number=Sing|Person=3 11 obl _ SpacesAfter=\r\n

11 uchratgan uchrat VERB Verb Aspect=Perf|Mood=Ind|Polarity=Pos|Tense=Pres|VerbForm=Part 12 acl _ SpacesAfter=\r\n

12 paytini payt NOUN Noun Case=Acc|Number=Sing|Number[psor]=Sing|Person=3|Person[psor]=3 13 obj _ SpacesAfter=\r\n

13 esladi esla VERB Verb Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Past 0 root _ SpacesAfter=\r\n

14 . . PUNCT Punc _ 13 punct _ SpacesAfter=\r\n\r\n

Conclusion

Universal dependency is productive tool to analyze syntactic structures of the text for relative languages. Considering the importance of syntactic parsing in corpus analysis, give good opportunity to model a number of syntactic structures of the text. One of our conclusion is manual improvement given grammatical features of each sentence of corpus can provide for disambiguation through no grammar but morphological component of parts of speech.

References / Список литературы

1. Duval E., Hodgins W., Sutton S. (2002). Weibel. Metadata principles and practicalities. 8(4): http://www.dlib.org/dlib/april02/weibel/04weibel. Html/

2. Abdurakhmonova N., Aripov M. Uzbek ontology of Uzbek language as example of adjective / Turklang - 2018 6-international conference. Tashkent, 2018. P. 234-237.

3. Abdurakhmonova N., Tuliyev U. (2018), Morphological analysis by finite state transducer for Uzbek -English machine translation / Foreign Philology: Language, Literature, Education. № 3 (68). P. 59-66.

4. Khusainov Aidar, Suleymanov Dzhavdet, Gilmullin Rinat, Minsafina Alina, Kubedinova Lenara, Abdurakhmonova Nilufar. First Results of the TurkLang-7 Project: Creating Russian-Turkic Parallel Corpora and MT Systems 2020/12/11CMLS 2020 Computational Models in Language and Speech. Vol. 2780. P. 90-101.

5. Aripov Mirsaid, Razzoqova Bibigul, Sharibbek Altinbay, Abdurakhmonova Nilufar. Ontology of grammar rules as example of Noun the Uzbek language and Kazakh languages VI international scientific conference "Modern problems of the applied mathematics and information technology. Al Khorezmiy-2018". 2018/9/13. P. 37-38.

i Надоели баннеры? Вы всегда можете отключить рекламу.