Научная статья на тему 'Dictionary block of the national corpuses of the Turkic languages'

Dictionary block of the national corpuses of the Turkic languages Текст научной статьи по специальности «Языкознание и литературоведение»

CC BY
207
24
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
НАЦИОНАЛЬНЫЙ КОРПУС / ТЮРКСКИЕ ЯЗЫКИ / ЭЛЕКТРОННЫЙ СЛОВАРЬ / ОПТИМАЛЬНАЯ СТРУКТУРА / ОСОБЕННОСТИ КОРПУСОВ / КОРПУСЫ ТЮРКСКИХ ЯЗЫКОВ / NATIONAL CORPUS / TURKIC LANGUAGES / ELECTRON DICTIONARY / OPTIMAL STRUCTURE / FEATURES OF CORPUSES / CORPUSES OF TURKIC LANGUAGES

Аннотация научной статьи по языкознанию и литературоведению, автор научной работы — Mammadova Rana

The aim of this article is to determine a more acceptable example for the compilation of the national corpus of the Azerbaijani language. Methods: The article provides information on the work being done in the field of creating national corpuses of Turkic languages by a descriptive method. The creation and improvement of the national corpuses of modern Turkic languages is a very important and relevant issue for computational linguistics. In the article, the author also relies on the descriptive-comparative method of research. National corpuses of Turkic languages, created by leading creative teams, differ in size, structure, and environment of use. In this article, for the first time, all existing national corpuses of Turkic languages are considered, methods of compiling language corpuses based on Turkic languages are analyzed, and already existing corpuses of Turkish, Bashkir and Kazakh languages are considered. These projects are of great importance in creating the national corpus of the Azerbaijani language, which is on the stage of development. The article also mentions the machine funds of the Turkic languages, which were created in different years in different Turk-speaking countries. The article also mentions the machine funds of the Turkic languages, which were created in different years in different Turk-speaking countries. Conclusion: The author, analyzing the materials on the topic, concludes that similar studies confirmed in leading countries and languages of the world cannot be applied in Turkic languages. There must be certain modifications. The practical significance of the article lies in the fact that when forming the electronic vocabulary block of the Azerbaijani language, you can use the main provisions and conclusions of this study.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

СЛОВАРНЫЙ БЛОК НАЦИОНАЛЬНЫХ КОРПУСОВ ТЮРКСКИХ ЯЗЫКОВ

Целью данной статья является определение более приемлемого примера для составления национального корпуса азербайджанского языка. Методы: В статье приводятся сведения о проводимой работе в области создания национальных корпусов тюркских языков по описательному методу. Создания и совершенствования национальных корпусов современных тюркских языков является очень важным и актуальным вопросом для компьютерной лингвистики. В статье автор опирается также на описательно-сравнительный метод исследования. Национальные корпусы тюркских языков, созданные ведущими творческими коллективами, различаются по размеру, структуре и средой пользования. В данной статье впервые рассматриваются все существующие национальные корпусы тюркских языков, анализируются способы составления языковых корпусов на базе тюркских языков, рассматриваются уже существующие корпусы турецкого, башкирского и казахского языков. Данные проекты имеют большую значимость в создании национального корпуса азербайджанского языка, которое находится в стадии разработки. В статье также упоминаются машинные фонды тюркских языков, которые были созданы в разных годах в разных тюрко-говорящих странах. Вывод : Автор, анализируя материалы по теме, делает вывод о том, что подтвержденные в ведущих странах и языках мира аналогичные исследования не могут применяться в тюркских языках. Здесь должны быть определенные модификации. Практическая значимость статьи заключается в том, что при формировании электронного словарного блока азербайджанского языка можно использовать основные положения и выводы данного исследования.

Текст научной работы на тему «Dictionary block of the national corpuses of the Turkic languages»

филологические науки - Мамедова Рена Гусейн кызы

языкознание словарным блок ...

УДК 81'33

DOi: 10.26140/bgz3-2019-0801-0025

словарныш блок национальные корпусов

ТЮРКСКИХ ЯЗЫКОВ

© 2019

Мамедова Рена Гусейн кызы, соискатель на ученую степень кандидата филологических наук Институт языкознания им. И.Насими Национальной Академии Наук Азербайджана (AZ1143, Азербайджан, Баку, пр. Г.Джавида 31, e-mail: rena.memmedova.1991@inbox.ru)

Аннотация. Целью данной статья является определение более приемлемого примера для составления национального корпуса азербайджанского языка. Методы: В статье приводятся сведения о проводимой работе в области создания национальных корпусов тюркских языков по описательному методу. Создания и совершенствования национальных корпусов современных тюркских языков является очень важным и актуальным вопросом для компьютерной лингвистики. В статье автор опирается также на описательно-сравнительный метод исследования. Национальные корпусы тюркских языков, созданные ведущими творческими коллективами, различаются по размеру, структуре и средой пользования. В данной статье впервые рассматриваются все существующие национальные корпусы тюркских языков, анализируются способы составления языковых корпусов на базе тюркских языков, рассматриваются уже существующие корпусы турецкого, башкирского и казахского языков. Данные проекты имеют большую значимость в создании национального корпуса азербайджанского языка, которое находится в стадии разработки. В статье также упоминаются машинные фонды тюркских языков, которые были созданы в разных годах в разных тюрко-говорящих странах. Вывод: Автор, анализируя материалы по теме, делает вывод о том, что подтвержденные в ведущих странах и языках мира аналогичные исследования не могут применяться в тюркских языках. Здесь должны быть определенные модификации. Практическая значимость статьи заключается в том, что при формировании электронного словарного блока азербайджанского языка можно использовать основные положения и выводы данного исследования.

Ключевые слова: национальный корпус, тюркские языки, электронный словарь, оптимальная структура, особенности корпусов, корпусы тюркских языков.

DICTIONARY BLOCK OF THE NATIONAL CORPUSES OF THE TURKIC LANGUAGES

© 2019

Mammadova Rana, candidate for a PhD degree in philology Institute of Linguistics named after I.Nasimi of the National Academy of Sciences of Azerbaijan (AZ1143, Azerbaijan, Baku city, H.Dzhavid ave. 31, e-mail: rena.memmedova.1991@inbox.ru)

Abstract. The aim of this article is to determine a more acceptable example for the compilation of the national corpus of the Azerbaijani language. Methods: The article provides information on the work being done in the field of creating national corpuses of Turkic languages by a descriptive method. The creation and improvement of the national corpuses of modern Turkic languages is a very important and relevant issue for computational linguistics. In the article, the author also relies on the descriptive-comparative method of research. National corpuses of Turkic languages, created by leading creative teams, differ in size, structure, and environment of use. In this article, for the first time, all existing national corpuses of Turkic languages are considered, methods of compiling language corpuses based on Turkic languages are analyzed, and already existing corpuses of Turkish, Bashkir and Kazakh languages are considered. These projects are of great importance in creating the national corpus of the Azerbaijani language, which is on the stage of development. The article also mentions the machine funds of the Turkic languages, which were created in different years in different Turk-speaking countries. The article also mentions the machine funds of the Turkic languages, which were created in different years in different Turk-speaking countries. Conclusion: The author, analyzing the materials on the topic, concludes that similar studies confirmed in leading countries and languages of the world cannot be applied in Turkic languages. There must be certain modifications. The practical significance of the article lies in the fact that when forming the electronic vocabulary block of the Azerbaijani language, you can use the main provisions and conclusions of this study.

Keywords: national corpus, turkic languages, electron dictionary, optimal structure, features of corpuses, corpuses of Turkic languages

Different approches reflect themselves not only in the various systematic languages, but also in the same language families in the issues of the compiling, structuring and deploying of the electron dictionaries which one of the important components of the national language corpuses. It is known that the idea of the preparing of the machine fund of the turkic languages in the area of the former Soviet Union in 1988 was put forward and the main directions of its creation were identified. In that period, well-known scientists in the group creating with the experts invited from St. Petersburg, Novosibirsk, Baku, Tashkent, Bishkek, Kazan, Ashgabat, Ufa, Nalchik, Cheboksary and Almaty have accepted a relevant decision relating to the creation of the maschine fund of the turkic languages. According to this decision, lexicographic, grammatical, statistical-stylistical, historical-etymological information belong to the turkic languages should be collected and systematiced in the comparative comparison plan. After these, the preparation of the rules as giving and snowing the real form, real meaning of the certain information were intended, during that period, the Institute of Linguistics of the Academy of Sciences of Kazakhstan was recommended for the creation of the machine fund of Балтийский гуманитарный журнал. 2019. Т. 8. № 1(26)

the turkic langueges [17, 47].

At the first stage, research of the structural phonetic diversities of the monosyllabic words were given in the front for the study of the process of the creation of the grammatical forms of the words in the turkic languages. In the future, the realization of the multi-functional Turkic Languages Machine Fund (TLMF) which can be modeling both General Turkish Language system and every specific language was taken in to account. TLMF had to regulate the works of the systematizing and collecting of the informations belonging to the lexicographical, grammatical, historical-etymological features of the turkic languages in the comparative -comparison plan. At the same time, intensive works relating to this problem were started in the other turkish regions, too [3, 154]. „

ODTU which is the first work about corpus for the Turkish language and realizing at the Ministry of Bilge Say is one of the "Research works developing corpus in the computer area that knowing as Turkish corpus. This corpus consists of the texts covering only written language since 1990. There is no example belonging to oral language. This is an offline corpus consists of two million words creating by the way of signing

after place to the electron area chosen of the samples from different types of texts [6].

The other research corpus on the Turkish language is the Turkish Language National Corpus named such as _ "Turkge Ulusal Derlem". This project supporting by the TUBiTAK (Turkiye Bilimsel ve Teknolojik Ara§tirma Kurumu) — Turkish Scientific and Technological Research Council was prepared by the researchers of the Linguistics Department of The Mersin University. The original version of this work started in 2008 was given to use in 2012, it is the corpus with the capacity of fifty million words, balanced, mixed (written-oral), synchronies and common having 95% oral, 5% written samples in the various fields between 1990-2009 [7].

TS Corpus consisting of 491 million words prepared by Taner Sezer was given to use such as online in 2012. TS Corpus is the general aimed corpus prepared with signing word form, morpheme, root of the word and giving great convenient to the user [5].

Historical Corpus of the old Turkish and Karakhanli Turkish (Eski Turkce Karahanli Turkgesinin Tarihsel Derlemi) (ETKTD) being the diachronical / historical corpus of the Turkish is the online corpus consist of400-450 thousand words relating to 600 years (VII-XIII centuries) created with the signing on the basis of the word combination and synbox by coping of written texts to the electron area belonging to the Orkhon Turkish, Uighur Turkish and Karakhanli Turkish [8].

Online written text corpus giving at last is the corpus realizing with the leading of the prof. Marcel Erdal between 1999-2003 years. ("Varislamische Altturkische Texte: Elektronisches Corpus "Vatec" (Electron Corpus of the Turkish texts before Islam) web page of the corpus was written in the Germanic language. The texts belonging to the period of the Uighur Turkish took place in the corpus [9].

Oral Turkish Corus "Sozlu Turkgu Derlemi" (STD) supported by TUBiTAK between 2008-2010 years is an online corpus that target of the following of the modern Turkish language in the computer area, analyzing of the information basis with the help of linguistic methods consist of 1 million words creating with the turkish speeches realized by the ways of vis-a-vis and different communication. Test version of the corpus has introduced to the users for searching in 2010, and the version consist of 400000 words in changing of speeches to written form toward the end of 2013 year was given to use [10].

2012, Linguistic researchers of the Mersin University introduced to project of the introduction version of the project of the National Corpus of the Turkish Language (TUD) starting in 2008.

The users can reduce or enlarge the surveys with different variants including informations such as the types "books, periodical publications, different publishing texts, different nine fields "social sciences, art, trade, economy, thought and believe, world problems, applied sciences, natural sciences and etc., "sex of the writer (male, female)", "Author, type of the authors (many, organizational, alone)", readers (child, young, every) and etc. published during 1990-2010 years. For the specialties they want [10].

In the conclusion of the searching written of the word "oyuncak" in the query interface, the frequency of the using in one million word and the use of the word how many times (1200) in how many different texts (450) searching in 4458 texts totally in the top of the monitor was given [16].

We can show the division of the word for type of "printing or publishing year, text samples, field, derivative text format, sex of the author, type of the author/authors, readers and type of readers with the button "menu" being in the right art of the monitor, the arrangement of the words being the left and right part of the word (oyuncak) making a column for the alphabetical order with the button "list", and the frequency of the words being the left and the right art of the key word at least.

TS Corus another Turkish online corpus is the general aimed unbalanced corpus giving great help to the user with 104

the signing of the type of the word, morpheme and root word. It is possible to include to the corpus created user's name password in the registration menu [8].

There are "corpus queries" consisting of the sections as "standard query", "restricted query", "word lookup", "frequency lists" and "Keywords" in the left part of the main page at the corpus prepared in English.

Including to the word "yuz" (uz) in the "query interface", 4,497 conclusions about the word "yuz" (uz) will be got completely with the simple query (ignore query) without difference of the capital letter or small letter. If we search this with capital and small letter (simple query case-sensitive), 41,656 words starting small letter, 4,413words starting capital letter will be got. It is possible to search the type of the word including to the "label" codes in the query interface at the corpus labeling in the direction of the word type. All of the conclusions labeling such as "verb" will be got during writing "verb" in the "query" interface at the corpus. It's not true to give the nouns such as verb creating verb root in that search system.

The simple root and derivative form of the word is getting during the root of the word (lemma) including as the {KOK} form in the "query" interface. It has two advantages to search that. The first of these advantages is to find the words as "gonlum, gonlun" which the form of the word root loose the last vowel in the conclusion of adding one of the suffix "— and etc." to this word during quering the word "gonul". The other advantage is to find the forms changed to the sounds "b, c, d, g" during to add suffix beginning with vowel to the words ending with the sounds "p, g, t, k".

The research being the third online corpus that we talked about is the historical corpus of the Old Turkish and Karakhanli Turkish — Eski Turkge ve Karahanli Turkgesinin Tarihsel Derlemi (ETKTTD). ETKTTD being historical corpus of the Turkish work is the online corpus consist of 400 - 450 Thousand words relating to 600 Years (VII-XIII centuries) created with the signing on the basis of the word combination and syntax by copying of written texts to the electron area belonging to the Orkhan Turkish , Uighur Turkish and Karakhanli Turkish , There is no need any of the user's name or password for including to the corpus.

There are "d, h, e, g, k" symbols and punctuation mark starting with (*) in the right part of the query interface and the monitor in front of us.

We can realize word researchers limiting in the form of covering all or one of the periods of "Karakhani Turkish" , " Uighur Turkish" and "Orkhan Turkish" and in the context of open the name of work (text)", "century and text type".

The researching word in the result page, text type, century, the name of the text and periodical information in the context of the syntax in given in one structure. The number of the line in the text, which the sentence belong to is giving in the left part of the sentence, which the word is used. We can find any of the word or the types of the word using with suffix with the help of the quotation mark. For example if we include to the "query" interface such as "igac", we can get the word "igagka".

Online written text corpus "Vorislamische Altturkische Texte: Elektronisches Corpus" (VATEC) (Electron Corpus of the Turkish texts before Islam) giving at last is the corpus realizing with the leading of the Prof. Marcel Erdal between 1999-2003 years. Web page of the corpus was written in the Germanic language.

"The form of the query of the Text Location (Text location query form)" gives the chance to the words limiting the query in the direction of the language, text, category giving the research all of the categories in the database.

It is possible to find the older words beginning with the searching word with the help of the quotation mark "*" near the word during to search any of the that word which Corpus location query form is giving completely that showing the search tool of the ancient Turkish word in the database VATEC. It is possible to query word combination or the morpheme in the ancient Turkish text which labeling in the Baltic Humanitarian Journal. 2019. T. 8. № 1(26)

филологические науки - Мамедова Рена Гусейн кызы

языкознание словарный блок ...

direction of the word combination / morpheme in the menu of the morpheme combination query form. It is possible to include to the division of the ancient Turkish word in the written system which is writing with the alphabets of the Goyturk (Runik) Uighur mani, Tibet chinees Suryani Brahami before islam from the menu of the writing of words query form.

We can see the context of the word which is used when we include to the name of text which the word using in the result page.

The works were started by the committee for languages of the Ministry of Cultural of Kazakhstan Republic for the national corpus of the Kazakh language when to start creating machine funds of the Turkic languages.

The researcher of A.K. Jubanov relating to statistical using of the Kazakh language text are attracted attention. He noted the using of 466 thousands of word forms approximately in the novel applying the novel of "Abay Joli" Written by famous . Kazakh writer M. Ayezov from a linguistic -statistical point of view 64% of the derivative verbs used in the languages of the writer is verb for its origin, 91%of these are the verbs creating with the help of the typical suffixes. It should be considered that, three or more typical suffixes can be joint to the same verb form , so this can be considered one of the special features of the Kazakh language and also Kazakh linguists prepared many of the statistical dictionary (frequency, reverse frequency and so on) and the information noted nowhere about the language are collected in these dictionaries. It is naturally that this useful information have to reflect themselves in the national corpus of the Kazakh language. These information must reflect so in the national corpus, that the user can get those without difficulty.

The realizing of the works of the creation of the illustrative-text fund of the Kazakh language was started in the first years. The electron versions of 20 volumes writing by the famous Kazakh writer M. O. Ayezov was prepared, alphabetical frequency dictionaries. The last versions were completed for every volume in parallel. The page and the line of this using in the book shows in the dictionary near by the frequency of the word forms. The useful feature of the given research is that the linguist researcher can get any of the word form in the context, display or in the paper such as printing form. It gives us to get much more information about the writer's or author's language. We must note that the 20 volumes of M. O. Ayezov consists of 3 million word forms approximately.

The information about every of the word form can get both for every volume separately and for 20 volumes completely.

After the electron version of the explanatory dictionary of the Kazakh language including 10 volumes was prepared every unit of the dictionary being in the dictionary was labeled as linguistic form lexical, meaning, morphological structure and syntactic function was reflected in this labeling.

The creation of the special dictionary block in the national corpus of the Kazakh language was aimed. The statistical dictionary prepared on the basis of the 20 volumes written by M. O. Ayezov and the explanatory dictionary including here 10 volumes, additionally grammatical dictionary, the frequency dictionary around various genres and the other types of the dictionary is involved to this block the setting of the electron versions of all the terminological dictionaries is conserved.

The academic dictionary - grammatical fund it taken into account such as one of the important components in the national corpus of the Kazakh language that fund consists of sub-corpuses. Historical etymological dialectological, onomastical, grammatical, lexical rules and dictionaries were set for certain relation scheme in these sub-corpuses.

Being of the rich software is conserved in the national corpus of the Kazakh languages also such as national corpuses on any languages have, that software must give the chance to realize linguistic analysis completely.

Automatic morphological, syntactic semantic analysis Балтийский гуманитарный журнал. 2019. Т. 8. № 1(26)

include here. The system providing to realize the analysis completely is being the function of the linguistic processor.

The Kazakh linguists think that, many of the scientific collectives must be attracted to the work of the creation of the national corpus and the word practice must be considered it in this field [3, 152-153].

The interesting works reflecting theoretical issue related to different directions, structure of the corpus linguistics were written in the periods[1],[4].

The works relating to the creations of the machine fund of the Bashgird language word started in 2003.

Machine fund of the Bashgird language was conserved for the linguistic teachers, instructor, students and peoples in the high school specially. Special information base was set in fund the interface were created in the information base and software for the aim of the working with query and fulfilling. The interface is used it in the meaning of "combination, relation, relation place, relation method" if the interface of the function of the personal computer, program can be unchangeable it can be modified without changing the principals of the opposite influence of this object with the other objects. For example, the interface is the same in the window programs. So that, the methods which the user is used in the communication with different devices are understood during to say "interface".

The machine fund of the Bashgird language consist of the sub-funds adjoined 7 information bases itself. The main catalogue consist of 100,000 word root and derivative reflecting necessary information about lexical system of the language. Covering all layers of the language of this catalogue is claimed. The sign more than 50 is shown for every lexical unit. The information about the belonging of the word to which part of speech, the origin, the style, the belonging to the dialect or literary language being historical word or archaism neologism being finite (non-finite , common), proper and etc. were set in the main catalogue. The main catalogue is related to the other bases of the sub-fund of the fund. It gives the chance to get additional informational, if need. The article of the dictionary about 500,000 dictionary signs in modern Bashgird language is given in the lexicography sub-fund. Academicals and educational dictionaries - unilingual, bilingual, multilingual, frequency, terminological, phraseological, synonym, query dictionaries, onomastic dictionaries expressing the name of the place (street, city, district and etc.) were represented in the lexicographical sub-fund.

The articulation features of the vowels and consonants of the Bashgird language were set in the practical-phonetic sub-fund. Phonetic dictionary consisting of 8000 units was given here. The individual learners of the Bashgird language can be used from this material.

The other two catalogues combining more than 2000 units reflected manuscripts and ancients printed books were represented in the fund for the aim of giving information about written literary language of Bashgird. The catalogues give the description of the manuscripts and ancient printed books and information noted in the following: headline (with translating to the Russian language), transliteration of the headline, author (author of the work), the name of the author (in transliteration form), the information about the person coping the face of the manuscripts, year (when it has copied), the size, the format (the number of the line in the page), feature annotation, who found, the language (the Arabic the ancient Turkish, Ottoman and so on), paleography, where it keeps, parole and etc.

The dialectological sub-fund was arranged from 3 independent bases-lexical, cartographical and textual bases.

The information about the academicals grammars the algorithmic description of the word - changing system of the Bashgird language and also statistical base of the morphemes were collected in the grammatical sub-fund.

According to the information of the author's machine fund of the Bashgird language was created in the Ufa scientific center of the Russian Academy of Sciences, in the

laboratory of the linguistics and Information Technology of the Institute of History, Language and Literature starting in 2011. The texts printing from them 20 year of the XIX century to nowadays relating to four main styles of the Bashgird Languages-artistic, publicist, scientific- educational, official-business are included to the fund.

The practice and the materials got by the creating of the machine fund of the Bashgird language are used in the preparation of the corpus. Here include the algorithms of the automatic analysis and synthesis, lexicographical base and etc.

The algorithms of the morphological analysis allowing change the word- forms to the dictionary unit in the corpus are considered main components.

Nowadays, the electron variants of the 579 text written by 63 authors relating to the beginning of the XX century and so far period in the corpus were prepared, 927,7754 word forms commonly were edited these texts were considered to the new orthography of the Basgird language accepted in 1981[2, 54-58].

The extra linguistic marker gives the chance to the user that it limits the search field showing the concrete parameters. It creates condition to the user for getting needed information during short time the notes about author (surname, name , middle initial the date of birth, sex and etc.), the text (name. the date of creation, genre, content, submission type, book, journal, electron text and etc.) the name of the source and publishing year were reflected in the extra linguistic information. Such extra linguistic science give the chance search, the indications to the user for searching the indications (in this situation, the sign is used in the meaning of "marker").

Morphological marker is realized to choose this or the other word form for the concrete morphological levels of the users. The morphological symbols and the lexemic affiliation of every word forms are showed before: the initial form of the word, parts of speech, feature, features of grammatical category of the word.

The large thematic classes and the features of word form of the lexemes were reflected in the semantic maker.

Nowadays the works creating sub-corpuses relating to periodical (newspaper, journal), folklore texts, official business and scientific texts of the national corpus of the Bashgird language were started [12, 54-58].

The preparatory works in the field of national corpus of the Turkic languages. The initial these works can be considered the initial stage of the creation of the national corpus of the Azerbaijani language.

It is necessary to note specially the real works of the project "Dilmanc" in the field of NLP starting the activity in the beginning of 2003. "Dilmanc" in the first system of the machine translation of Azerbaijan. The creation of the formal grammar of the Azerbaijan language was started the algorithms of the formal morphological, syntactic, and semantic analysis in the initial language, the algorithms of the synthesizes of the sentence in the translated language were prepared. The large texts can translate in the system and the true pronunciation of the translated samples and various words can be sounded, too.

It is possible to download the program "Dilmanc imla" to the android and IOS phones free. After this, coordinator program must be downloaded from the site of the project to the computer. The oral speech saying do the phone in writing to the computer changing to the text automatically after including the given code to the phone and there is no need to write in the keyboard in this program.

The group of the Turkic languages are considered one of them largest language group according to some of the sources, more than 50 Turkic languages are existed and about 40 of them are used nowadays. But 15 Turkic languages are already considered dead languages.

The research and solving of the reconstruction issue of the ancient Turkic Language are conserved in the initial stage of the creation of machine fund of the Turkic 106

languages (national corpus of the Turkic languages). Just for this, the collection of the information such as structural-phonetic information, the lists of the morpheme, the schemes reflecting syntactic relation, grammatical thesaurus of the affixes relating to different types of the monosyllabic word roots of the Turkic languages and was considered necessary for created national corpus essentially.

Thus, the creation of the linguistic bank of the Turkic languages giving the chance to be learned of the problems of general Turkish and interlingua phonetic, lexical, morphological, syntactic, semantic analysis was considered in the first stage.

Nowadays the researcher in the field of the creation of NLP systems are realized within the framework of the project "Dilmanc" in Azerbaijan. Many of the works were realized and are realizing till now within the framework of the project [11].

We can note the bilingual parallel text corpuses relating to all of the styles of the language, prepared within the framework of the project and capacious unilateral corpuses relating to concrete languages, specially. From these, English-Azerbaijani bilingual corpus consist of 2 million sentences, Turkish-Azerbaijani unilateral corpus consist of 277 thousands sentences, Russian-Azerbaijani bilingual corpus consist of 4.5 million sentences, Azerbaijani unilateral corpus consists of 60 million sentences and Turkish unilateral corpus consists of 322 million sentences.

It is necessary to note the system of the "Poliglot" dictionaries has gained the sympathy of the user on this issue. The problem has been prepared within the framework of the developing of the information-communication technologies in the Republic of Azerbaijan. The dictionaries including there German-Azerbaijani, English-Azerbaijani, Russian-Azerbaijani, French-Azerbaijani and also Azerbaijani-German, Azerbaijani-English, Azerbaijani-Russian, Azerbaijani-French were represented in the project. The system of the checking orthography in the Azerbaijani language was presented to the users after preparing in the project of that program [12].

Except this, the information base of the Azerbaijani language has been prepared in 2017. The information about the sections such as the Azerbaijan language, official documents, information about the Azerbaijani linguists (Doctors of Philosophy, Doctors of Sciences, correspondent members, full members), the library of linguistics (dictionaries, abstracts, monographic and manuals) were given in the base [13].

The corpus of the electron dictionaries of the Azerbaijani language has been created with the financial support of the Science Development Foundation under the President of the Republic of Azerbaijan (EIF-KETPL-2-2015-1(25)) in 2018. The dictionary of the orthography, explanatory dictionary, abbreviations dictionary (Foreign Language), the dictionaries of the names of male and female are included to the corpus. It is possible to see all of the given samples in the dictionaries during to search any of the word [14].

The analogical researches confirmed in the leading countries and languages of the world cannot applied in the Turkic languages. But the scientific idea if the issue can be applied to all the world languages with certain modifications.

It is possible to use the researches realist before relating to various fields of the languages in the national language corpuses and even through it is necessary as it is known the most important function of the corpus is to introduce certain information about concrete language to the user. Simply, such method, rule, procedure must be prepared that the user can get the chance of special aspect belong to any languages with the help of this. So that, the aspects reflecting the language don't use specially. The forms of the linguistic information belong to the language introducing to the user are used.

REFERENCES:

1. Баранов А.Н. Корпусная лингвистика //Баранов А.Н. Введение в прикладную лингвистику: Учебн.-метод. Пособие. — СПб., 2005. 48 с.

Baltic Humanitarian Journal. 2019. Т. 8. № 1(26)

филологические науки -языкознание

Мамедова Рена Гусейн кызы

словарный блок ...

2. Бускунбаева Л.А., Сиразитдинов З.А. О проблемах национального корпуса башкирского языка. Материалы. «Современное казахское языкознание: актуальные вопросы прикладной лингвистики». Алмааты 2012, с.54-55.

3. Жубанов А.К. Казахское языкознание: прикладная лингвистика. Алматы, «КИЕ», 2012, 696 с.

4. Захаров В.П. Корпусная лингвистика. Учебно-методическое пособие. Санкт-Петербург, Санкт-Петербургский государственный университет, 2005, 48 с.

5. Electronic resource: http://www.tscorpus.com/tr (Access date: 03.01.2019)

6. Electronic resource: http://lib.metu.edu.tr/tr/odtu-tez-koleksiy-onu-sorgulama-sayfasi (Access date: 03.01.2019)

7. Electronic resource: https://www.tubitak.gov.tr/ (Access date: 03.01.2019)

8. Electronic resource: https://tscorpus.com/ (Access date: 03.01. 2019)

9. Electronic resource: http://derlem.cu.edu.tr/ (Access date: 23.12. 2018)

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

10. Electronic resource: http://www.dam.org.tr/index.php/tr/derlem-ler/66-soezlue-tuerkce-derlemi (Access date: 03.01.2019)

11. Electronic resource: www.dilmanc.az (Access date: 03.01.2019)

12. Electronic resource: www.poliqlot.az (Access date: 03.01.2019)

13. Electronic resource: http://azerbaycandili.az/Home/Index (Access date: 03.01.2019)

14. Electronic resource: http://korpus.azerbaycandili.az/ (Access date: 06.01.2019)

15. dliquliyev R., §ukürlü S., Kazimova S. Elmi fSaliyystds istifad3 olunan ssas terminbr. Baki, "informasiya Texnologiyalari", 2009, 201 s.

16. Karaoglu S. The Relationship Between The Structure And SocioCulturel Crime Case: The Case Of Malatya. Electronic resource: http://der-gi.kmu.edu.tr/userfiles/file/Mayis20142/30m.pdf (Access date: 18.12.2018)

17. MMahmudov. Kompüter dilçiliyi. Baki, "Elm vs tshsil", 2013, 356 s.

Статья поступила в редакцию 04.02.2019 Статья принята к публикации 27.02.2019

Балтийский гуманитарный журнал. 2019. Т. 8. № 1(26)

107

i Надоели баннеры? Вы всегда можете отключить рекламу.