Научная статья на тему 'THE IMPORTANCE OF LINGUISTIC MODELS IN THE DEVELOPMENT OF LANGUAGE BASES'

THE IMPORTANCE OF LINGUISTIC MODELS IN THE DEVELOPMENT OF LANGUAGE BASES Текст научной статьи по специальности «Языкознание и литературоведение»

CC BY
417
69
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
CORPUS / SPELLING MODULE / MORPHOLOGICAL MODULE / LINGUISTIC MODULE / WORD-COMBINATION MODULES / WORD ALGORITHM / FORMULA ALGORITHM / TABULAR ALGORITHM / GRAPHICAL ALGORITHM

Аннотация научной статьи по языкознанию и литературоведению, автор научной работы — Toirova G., Hamroeva N.

The article discusses the transformation of language into the language of the Internet, computer technology, mathematical linguistics, its continuation and the formation and development of computer linguistics, in particular the question of modeling natural languages for artificial intelligence. The Uzbek National Corps plays an important role in enhancing the international status of the Uzbek language. The work carried out in the field of computer linguistics plays an important role in resolving existing problems in the Uzbek language. The question of the linguistic and extralinguistic separation of special tags for marking texts and their components is studied in particular.The coding requirements for important text information are defined. Considerations have been made that specific linguistic model forms should be developed for the marking of each word group. The text marking format, the coding requirements for important text information are studied and the existing body marking standards are taken into account. In view of the fact that there is currently no system for automatic text processing and searching on the basis of different characters from the text, it was noted that the layout is the main task of creating a corpus.The state analyzes the linguistic module and the algorithm and its types from independent components of the linguistic program code. The need for algorithms for phonological, morphological and spelling rules for the formation of the lexical and grammatical code is scientifically substantiated. The importance of such linguistic modules as phonology, morphology and spelling in the formation of the linguistic base of the national corpus of the Uzbek language is emphasized.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «THE IMPORTANCE OF LINGUISTIC MODELS IN THE DEVELOPMENT OF LANGUAGE BASES»

PHILOLOGICAL SCIENCES

THE IMPORTANCE OF LINGUISTIC MODELS IN THE DEVELOPMENT OF LANGUAGE BASES

Toirova G.

Bukhara State University PhD in philosophy, associate Professor Hamroeva N. Bukhara State University, teacher of the Department of Preschool Education

ABSTRACT

The article discusses the transformation of language into the language of the Internet , computer technology, mathematical linguistics, its continuation and the formation and development of computer linguistics, in particular the question of modeling natural languages for artificial intelligence. The Uzbek National Corps plays an important role in enhancing the international status of the Uzbek language. The work carried out in the field of computer linguistics plays an important role in resolving existing problems in the Uzbek language. The question of the linguistic and extralinguistic separation of special tags for marking texts and their components is studied in particular.The coding requirements for important text information are defined. Considerations have been made that specific linguistic model forms should be developed for the marking of each word group. The text marking format, the coding requirements for important text information are studied and the existing body marking standards are taken into account. In view of the fact that there is currently no system for automatic text processing and searching on the basis of different characters from the text, it was noted that the layout is the main task of creating a corpus.The state analyzes the linguistic module and the algorithm and its types from independent components of the linguistic program code. The need for algorithms for phonological, morphological and spelling rules for the formation of the lexical and grammatical code is scientifically substantiated. The importance of such linguistic modules as phonology, morphology and spelling in the formation of the linguistic base of the national corpus of the Uzbek language is emphasized.

Keywords: corpus, spelling module, morphological module, linguistic module, word-combination modules, word algorithm, formula algorithm, tabular algorithm, graphical algorithm.

Introduction.

It is no secret that today's growth in developing countries is due to many factors, including the process in innovation-advanced innovations, commitment to timely implementation of technologies. Innovation is, in fact, the key to growth. As a consequence of the event of new developments in research, the adoption of recent words in language at the expense of external sources, the scale of their use is increasing on a daily basis. In particular, we can see that Uzbek 's computer linguistics is getting richer thanks to the words learned from international computer linguistics. Let's observe the term "module" as an example.

This term is used in the field of informatics: "1) module - program file; 2) module - an object that makes up the code; 3) module - a set of computer cooling systems; 4) MOD is used in such senses as music file format "[13], in mathematics:" 1) absolute height; 2) vector modules; 3) modulus of automorphism; 4) the coefficient of conversion of a logarithm in one system to a logarithm in another system, as well as the absolute value of the magnitude "[9]. In the field of mechanics: "1) Young's module; 2) modulus of elasticity; 3) displacement module "[14]. Today, "a module is a complete functional part of a program; modular teaching is modern education, ie step-by-step teaching according to the level of knowledge "[3,12].

The term "linguistic module" plays an important role in the field of computer linguistics. For example, the conversion of natural language into a machine language , i.e. the development of ways to process text via a computer system. In this end, linguistic programs in

other languages are being created. The linguistic module is an integral part of these linguistic programs. For example, if the lexical module is surrounded by a dictionary layer (words), the grammatical module edits symbols, punctuation, letters and other characters, the spelling rules of the spelling module, the morphological module analyzes words (from word to lexeme analysis) and the synthesis process (lexeme formation), the su-persyntactic unit in the syntactic module-the intercon-necti phenomenon.

Analysis of the relevant literature. In her research, M. Abzalova notes: "In order to obtain realistic results in the development of a linguistic framework of word classes, first of all, the affixes that form them and their combinations are attached to words and are the best way to reach the linguistic base." We recommend using the following linguistic modules suggested by M. Abzalova in the formation of the Uzbek Language National Corps:

"The affixes added to the key words in the modulation of the noun category are defined as follows: affix of affiliation: q_a= -niki; affix of place : uj -dagi; affix of limiting: ch_q[3]= {-gacha, -kacha, -qa-cha};

affix of plural: Pl_a= -lar; consonant affixes (with variants): k_a [7] = {-ning, -ni, -ga, -ka, -qa, -da, -dan}; possessive affixes: e_a [9] = {- m, -im, -ng, -ing, -lari, -miz, -imiz, -ngiz, -ingiz};

noun-forming affix: sh_y = -lik;

1st type affix of person-number category: sh_s1 [-man, -san, -miz, -siz; -simiz, -sisiz]

affixes: -mi, -chi, -gina, -kina, -qina, -dir, -u, -yu, -da, -a, -ya.

The following examples can be given to the module of attaching the given affixes to the core (A = base, N = derivative): 1. N=A q_a: 6onaHHKH= 6o-naHHKH

2. N=Ar ii_j: boladagi = bola rdagi

3. N=AI ch_a| 11: bolagacha= bola gacha

4. N=Ar 1 PI a; bolalar= bola 1 riar

5. N=AI lk_a[7]; bolaning= bola II ning

6. N=Al le_a[6]; bolam= bola Im

7. N=AI I k_a II e_a[6]; bolalarim= bola I__larL Lim

8. N=AL J k_a[6]; bolamga = bola I__I ml__Iga

9. N=AJ Pl_aLe_a[6] J k_a[6]; bolalarimga=bola Pl_a; lar c _a|6|:m k_a|7|:ga

10. N=AD C e_a[6] □□uj; bolamdagi=bola m dagi.

The modulation continues in this order" [2].

In the process of creating a national corpus in the Uzbek language, an optimum version of M. Abzalova is being used. The algorithm of phonological, morphological and orthographic rules shall be established in order to form a lexical-grammatical code in the linguistic norms module of the Uzbek language phrases.

Methodology of research. What's the [6]

algorithm? Algorithm, algorithm-a clear rule (program) for the execution of actions in a certain order that are used to solve problems of a particular type. One of the basic concepts for cybernetics and mathematics. The rule that performed four arithmetic operations on a decimal number system was called an algorithm in the Middle Ages. [15] The computer with its computing power is fast, clean, accurate and at the same time "completely incomprehensible"[7]. The idea that when we use it to solve a number of problems, the computer invents something on its own is a mistake, and a clear and complete instruction is needed for the computer to work. An algorithm is a rigidly set order that performs the action needed to produce the final result. This may sound strange, but we're always confronted with an algorithm in real life. An example of this is the use of a payphone, which includes a sequence of actions required for a successful phone call. The rules for the use of home appliances, etc., in a short, understandable way, tell us what to do in one way or another, and determine the algorithm of our actions. According to historians and mathematicians,[21] the word "algorithm" is derived from the name of our great ancestor Abu Abdullah Muhammad ibn Musa al-Khwarizmi, and his famous book "Kitab al-jabr wa al-muqabala" has given rise to another popular term "algebra." It is fair to say that the basic algorithm for the production of instructions is controlled in the process of computer-assisted activities. We can not, however, transfer our records directly from the algorithm to the computer, because they are written in a language that the computer does not understand, only people understand. For a computer to understand an algorithm, it is translated into a machine language, just

as algorithms written in a machine language are called programs or computer programs. Important features of the optional algorithm: the accuracy of the algorithm -the value of each step, discreteness - the process of solving the problem can be divided into several simple steps (execution steps) so as not to cause difficulties for the computer or person, the publicity - usefulness of the algorithm - the end of the actions of the algorithm, which allows to obtain the desired result with the initial data in the final steps [20].

In practice, there are the following types of algorithms: linear-algorithm in which actions are carried out sequentially, without any conditions being checked, branching-algorithm in which instructions are predetermined by conditions change, cyclic-al-algorithm in which individual processes or groups of processes are repeated. Methods of writing algorithms are considered to be verbal, formulaic, tabular, graphical.

The information available serves as a raw material for the processing of computers. In metallurgical production, that is, as metal ore is considered a raw material. However, in order to be effective in processing, the optional raw material must have an initial preparation. First, we collect information about the event we 're interested in, then we systematize and classify this information. Next, we 're building a module that represents a given event. The module represents an event using a special mathematical device, graphics, diagrams. The module is structured to show the characteristics and key aspects of the situation. Mathematical and simulation modulation is also available. Mathematical modulation is the application of a mathematical instrument to the study and expression of an event. The exact mathematical module allows you to observe and analyze the status of an object. Simulation modulation-mainly used in industry, allows you to perform a series of tests on devices that do not exist in real time using computer equipment and special software. The application of this modulation accelerates the production of raw materials, as the construction and research process is reduced, the number of errors and their costs are reduced. For example, Boeing declined to implement a longstanding plan for the position of passenger seats, the development of natural cabin modules, and replacing them with computer modules. This saved millions of dollars and reduced the time for the production of new aircraft parts. Once the module is built, it moves to the step of creating an algorithm that matches it. Problems that have been solved by algorithms. In a computer language (machine code), the algorithm used to solve a problem in the form of a series of commands is called a machine program. The command of a machine program or machine is an elementary machine instruction that is executed automatically without additional instructions and concepts. Programming is a theoretical and practical program activity. The process of translating an algorithm into a machine language is called compiling. The first step in "humanizing" machine language was to create programs that convert symbolic names to machine code. Then programs for converting arithmetic expressions were created, and

finally, in 1958, the Fortran translator, widely used in the programming language, came into being. Since then, many programming languages have been developed. Computer processes information by controlling machine program commands, using different data in the process. The data used are divided into: 1. Incoming-inputs to the computer and is used as a condition to solve the problem. 2. Current or internalused to store and process information in the program. 3. Output-data generated by the program as a result of the processing of information : Text, graphics , video, etc. It could be visible. This means that it is always important to create an algorithm for the creation of the national corpus of the Uzbek language, as it is controlled in the process of computer work.

The national corpus of the Uzbek language is the lexical unit that exists in the Uzbek language, such as synonyms , antonyms, homonyms, assimilation words, hierarchies of words; it is necessary to be able to automatically analyze the morphological structure of the word, the construction of the word, the meaning of the word, its morphological features. In other words, in the process of composing, lemming, marking the corpus, it is necessary, on the basis of individual searches, to find and interpret those words which form part of the corpus in the texts. In order to do this, the above-mentioned algorithm, linguistic modeling, must be carried out. M. Abzalova 's research "Linguistic modules of the program for editing and analyzing texts in the Uzbek language"[2], A. Eshmominov 's research" Synonymous database of the Uzbek national corpus"[17], automatic analysis of the morphological characteristics of words. It is necessary to use some parts of Sh. Khamroeva 's research on "Linguistic bases for the creation of the author's corpus of the Uzbek language"[18], N. Abdurahmanova 's research on" Linguistic support for the program for the translation of English texts into Uzbek"[1].

"Dictionary of synonyms of Uzbek language", "Explanatory dictionary of Uzbek words", "Dictionary of obsolete words of Uzbek language", "Dictionary of synonyms of Uzbek language", "Dictionary of Uzbek words", "Dictionary of synonyms of Uzbek language" "Dictionary of contradictory words of the Uzbek language", "Dictionary of word classification of the Uzbek language", "Educational etymological dictionary of the Uzbek language", "Educational toponymic dictionary of the Uzbek language" can serve as a linguistic support. Only such dictionaries are reworked, lemma words; depending on the nature of the words, it is necessary to delimit their series and connect the members of the lemma series with each other. Only then can the revised dictionary form the basis of the software for the programmer.

In the final stage, texts prepared with meta-metric and morphological markings undergo several more automatic transformations. The following programs written in "Perl" language are used:

1) The converter converts the working format of the socket to the final format. The converter converts the morphological analysis in parentheses to the correct

format <w lex =.....gr =....> It also checks for some

spelling errors in order to further improve the quality of

the search engine, translates the name into Latin, adds insufficient characters, identifies different forms of the verb;

2) Semantic markup program (Semmarkup).

The program adds basic semantic characters to words using a special semantic dictionary. This method makes semantic search in the corpus much easier. The semantic dictionary is formalized in the form of a table, the first column contains a lexeme and a phrase, and the remaining columns contain semantic symbols. After the program compares the morphological characters of the word with the dictionary and finds similarities, it copies the semantic characters in the sem attribute of the <w> tag. In multi-character words, however, certain errors may occur in the semantic search;;

3) Statistical programs (Gramstat, Metastat). These programs are designed to collect statistics on the distribution of grammatical and metamaterial characters in texts. This method allows you to quickly find errors in the characters. The gramstat program allows distribution in morphological analysis (lexeme, word group, lexeme, and grammatical features of word form) for individual parts.

The above technology helps automate complex processes for the preparation of corpus texts. Some operations (cleansing of text, removing homonymy, metametric) are not automated at all, but a number of service tools have been developed for these operations, which makes it much easier. From the start the data was deliberately easy to encode so that the additional marks did not interfere with the text edition. The complex formatted output format takes place in the last stage automatically.

The Russian National Corps, the Modern American English Corps, Oxford English Corps and Czech National Corps have been established worldwide. Uzbekistan has, however, not created a linguistic foundation. Ziyonet does not work at the system to process text automatically and perform searches based on different characteristics from the text although it currently has an electronic library. It is not meant for vocabulary or language learning. The text can not be heard aloud. A system of automatic processing of texts and searches based on several characteristics is established in the national corpus program, the database. Word, phrases and combinations that are rarely used are very easy to find, use and spell (spell) from. This allows the learner to hear the text aloud. This opens up the possibility for directional education. A key role for the body is to mark or to identify (linguistic analysis). Marking means separating special tags into texted and their components in linguistic and extra-linguistic terms. Currently, there are the following types of markups: morphological, semantic, syntactic, anaphoric, prosodic, discrete, and others [11]. An extralinguistic mark is distinguished by the following features: a mark that reflects the specificity of the text format (chapter, paragraph, section, etc.) and a mark that represents the information belonging to its author.

Analysis and results. Most modern layout languages are based on SGML / XML, in which the defined text covers two parallel data layers: visible (text itself) and hidden (tagged or marked) [11]. In this case,

the hidden part of the information is placed inside the text, but special markers <...> are included, which, in turn, separate it from the visible text. Unlike external methods of annotation writing (e.g. comments), the markup is always incorporated into the text and is an integral part of it. Subsequent levels of structural analysis are used by some corporations. In particular, some small corpuscles will be connected on the basis of a complete syntactic analysis. Such cases are usually characterized by a profoundly interpreted or syntactic structure. For example, a syntactic markup is like a large tree in itself. We know that manual analysis of texts is a valuable and time-consuming task. Currently, various software analysis tools are available on Russian and foreign sites, which are open (directly) accessible. They are individual, i.e. independent and subdivided into websites. In this case, it should be noted that in recent years, developers have focused on web

The morphological marking system includes word, lemma, and tag. A word form is a morphological unit in a selected text. The first step in marking a word is to lemma it, that is, to bring out the lexeme form of the word. The most difficult step in marking inflected languages is lemmatization, that is, attaching the lexeme form of a word to a word as a tag. Because we know that in inflected languages the grammatical meaning of the word is mixed with the core of the word. Unlike inflected languages, the process of lemma in agglutinative language is much easier [4]. Initially, the analysis options for word forms are given in the form of a list, by selecting the correct option or editing the existing option. The editor makes it easy to navigate the text and make global changes and alterations. Thus, the marking application falls into a familiar environment and makes effective use of all the features of this editor. For the purpose of visual separation, different elements of the text are decorated in different colors and styles. Particularly,

applications. These systems have several advantages: the ability to analyze (mark) a single document by multiple users at once does not require the installation of additional software, but with the exception of the browser, access rights are limited, and the marking process can be monitored. In particular, let's pay attention to the process of analyzing the text from the story "Speech" by A.Qahhor. Text goes as following: "You don't love me, you 're not happy with our marriage, I've been waiting until this hour, this minute, you haven't said a word, it's been a year since we put our heads on a pillow ...

The speaker really forgot about it, but he was talking."

The text mentioned above is distinguished by the following features:

1-table

— Analysis of the layout and the command variant is formalized in the form of hidden text and is usually not visible in normal mode;

— word forms are formalized in different colors depending on the number of analysis options: zero, one or more.

The grammatically impersonal part of the word is the same as the stem or base lemma. The mark is given in the character <*> of the lemma. If the lemma in all the word categories is based on this principle, that is, the principle that "the root part of the word is equal to the lemma," the verb lemma II in the verb group is given in the form of an imperative mood. In dictionary articles, the verb is given in the form of an action name: <go>. However, this form is not appropriate for the corpus because the text in the corpus is searching for the <bar> form, not the <go> form of the word. The verb lemma is therefore given as <taught>, not <be>, shown as <blind>, received as <received> [17]. The marking process requires writing 5 to 10, sometimes

№ Type according to the sentence structure

1. [simple sentence] <cr>, </cr>

2. [yromraH ran] <yr>, </yr>

3. [complex sentence] <kt>, </kt>

№ The type of sentence used for the purpose of expression

1. [gapaK ran] <gr>

2. [cypoK; ran] <cr>

3. [6ynpyK ran] <6r>

№ Depending on whether or not the owner is represented in the linguistic construction of the speech

1. ^ra^H ran] <E+>

2. ^racro ran] <E-> [maxcH HOMarnyM ran] <m.H.r>

[aTOB ran] <a.r>

[ceMaHTHK-^yHKUHOHai maKnamuH ran] <c.$m.r>

№ According to the participation of the primary and secondary segment

1. [hhfhk; ran] <nr>

2. [ëHHK ran] <ër>

№ According to the presence of parts that do not make grammatical connection with the sentence

1. [yHgaiMa] <y>, </y>

2. [KHpHTMa] <K>, </K>

even more, morphological tags (comments) for each word.

The main advantage of SGML / XML compared to other layout languages (TEX, RTF) is that it has strict syntax of markup commands, differentiating attributes and elements, clear indication of element boundaries, self-documentation, automatic verification of grammatically correct entry.

The most authoritative standards for corpus data encoding are: TEI (Text Encoding Initiative)[5], CES (XML Corpus Encoding Standard)[8], EAGLES (European Advisory Group on Language Engineering Standards)[10]. In particular, TEI is recognized as a well-developed standard, defining the rules for the expression of different types of texts and textual information elements, with particular emphasis on: structure, title, style of speech (prose, poetry , drama), pages, quotations, footnotes or links (footnotes, comments), corrections, tables, formulas, specific characters (characters), linguistic annotations, etc. The special title of the standard shall be subject to the rules for the coding of the case. Although TEI is not specifically tailored for corpus applications, it often works in conjunction with similar standards. For example, the British National Corpus (BNC), the Czech National Corps, the Hungarian National Corps, etc. The XCES standard is an advanced application of TEI, designed solely for the corpus and intended to identify specific labels specific to the corpus.

But when we studied the TEI and XCES universal standards in detail, we found that they were too complex, unnecessary, and inconvenient for text mass marking. The full provisions of the TEI are very broad and not always reasonable, and it is therefore difficult enough to comply with all the requirements of this standard. The format is not compact, and the size of the content is usually increased. The format loses its clarity function, for example, it is suggested that metaattributes be written in the form of text in the tag, so that when the markup is removed, the original text returns to its original state, error occurs.

You can also restrict yourself to TEI applications by rejecting "redundant" tags. The minimum set of tags is selected from the TEI to represent the body: <text> -text, <p> -header, <s> -word, <w> -word, and morphological analysis is written in the form of <w ana = ...> attribute. However, such an appearance does not fully comply with the standard of the housing layout. This view is reminiscent of a simplified HTML version.

The complexity of XML formats is not the main problem, but the complete lack of popular programs such as preparation, processing, indexing and searching, which is a major problem. Linguists have relatively simple programs available to them. Among them: XML-analysts, editors, converters, linear search programs are widely used. It turns out that such a set of programs is not enough for a corps with a volume of millions of words. Of course, tasks such as preparing the internal problems and markings of the case can be solved with the help of specially written converters, macros and other tools.

The data representation format in the case is developed based on existing coding standards (TEI,

XCES). HTML belongs to the SGML / XML family, is the most common format, and can be used in many applications [19]. Today, search engines have the ability to understand the semantics and structure of HTML tags.

HTML is a very simple format that provides minimum requirements in terms of content and layout size, and is not able to use many commands in practice. It's a very convenient and compact format for manual editing and visual perception. Typically, when displaying language units, there are no tags in the standard itself, but HTML can allow non-standard tags to be used, and this problem is resolved through a special setup (correction) of the search server.

The corpus format has a number of HTML languages, with some special tags attached for linguistic units. This format specifies the coding requirements for important text information and includes:

1) meta text attributes;

2) text structure elements (title, paragraph, poems, footnote or link (footnotes, comments) and tables at the bottom of the page);

3) linguistic units (sentences, words);

4) lexical information (grammatical, semantic signs);

5) text formatting parameters, special characters, etc. [20].

Meta text attributes are written in texts in different situations, so that steps 2 and 3 can be done in parallel or arbitrarily. But the text must have the name of the file identified and recorded. It does not perform any actions, such as renaming a single connection or file, as such actions could disrupt the operation of the entire system. For the purpose of storing metadata, simple Excel spreadsheets with a predefined structure are used, with the first column containing the name of the file (clearly specified path) and the other columns with metamata attributes and process information. This allows you to use Excel's built-in tools effectively and makes the search engine much easier. For example, search, filtering, analysis and data processing (to-do list, auto-filling, statistics). In this case, the tables must be stored in a text format, and this format must be understood by Excel. This allows the file stored in the spreadsheet view to accept not only Excel but also other spreadsheet programs and increase the runtime efficiency.

Theoretically, metadata can be stored separately from each text, but according to the HTML rules, the data must be stored in the file header so that the Yandex-server can index the data. When storing metadata in separate memory, there is always a problem of synchronization, meta-tables, and text interactions with each other.

Suggestions. The following methods are used to store metadata in separate memory:

1)The metas table creates meta-table headers by collecting meta-text attributes from the file headers. In Excel, it can be modified manually. At the initial processing stage, some metadata can be added to the text, such as the author's name, title and date of creation. At the final stage, the Metas.bat program

collects all attributes and completes the verification phase.

2)Meta.txt takes the meta text attributes from the modified meta-tables and transfers them to the existing text. This program checks the availability of the file and updates the title. In the tables, most attribute actions are separated by a" "symbol. When the text is changed, each action will appear as a separate attribute. Metamata attributes can therefore move freely between text and meta-tables. Meta-metric, on the other hand, will need to be carried out interactively with several cycles of verification.

3)MetaTest checks the accuracy of the meta-table. In this case, the actions of the attribute in the normative table are compared with those shown in the templates. The program identifies incorrect actions with a "#" character and can be checked and corrected manually.

All the above programs are done in Perl.

At the final stage of processing, texts prepared with meta-metric and morphological markings undergo several more automatic transformations. The converter checks for some markup errors in order to further improve the quality of the search engine by converting the morphological analysis in parentheses to the correct format <w lex =.....gr=....>.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

The semantic markup program adds basic semantic characters to words using a special semantic dictionary. This method has the property of greatly facilitating semantic search in the corpus. The semantic dictionary is formalized in the form of a table, the first column contains a lexeme and a phrase, and the remaining columns contain semantic symbols. After the program compares the morphological characters of the word with the dictionary and finds similarities, it copies the semantic characters in the sem attribute of the <w> tag. In multi-character words, however, various errors can occur in semantic search.

The above technology helps to automate complex operations in the preparation of texts for the corpus. Some operations are not automated at all (clearing texts, removing homonymy, meta-metric), but a set of service tools has been developed for such operations, which makes it much easier. From the very beginning, the data encoding format is developed in a special simple form. As a result, a complex layout development format occurs automatically at the final stage.

In conclusion, it should be noted that the role of linguistic modulation in the formation of the national body 's linguistic base is incomparable. It is therefore necessary to create an algorithm as a basis for the production of controlled instructions in the computer process. It is important to develop specific linguistic module forms by marking each word group in the development of a morphological marking algorithm.

Given that increasing the international status of the Uzbek language, raising it to the level of a world language of communication, learning and teaching Uzbek abroad, and expanding and polishing the capabilities of our national language directly through the national body, the practical significance of the work will be a key factor for development and survival.

References

1. Abduraxmonova N.Z. Linguistic support of the program for translating English texts into Uzbek (on the example of simple sentences): Doctor of Philosophy (PhD) il dis. aftoref. - Tashkent, 2018.

2. Abjalova M. Linguistic modules of the program of editing and analyzing texts in the Uzbek language (for the program of editing texts in official and scientific style): Doctor of Philosophy (PhD). dis. -Fergana, 2019. - P.22.

3. Avliyokulov N.X. Technology of modular teaching of professional sciences. - T.: Yangi asr avlodi, 2004. -106 p Stepanov A.N. 6.3. Archiving of file objects // Informatics: basic course: for students of humanities specialties of universities. - Peter, 2010. -719 p.

4. Vanyushkin A. S., Grashchenko L. A. Assessment of algorithms for the selection of key words: tools and resources // New information technologies in automated systems. - 2017. - № 20. - S.. 95-102.

5. Zakharov V.P. Corpus Linguistics: Uchebno-metod. posobie. - SPb., 2005. - 48 p.

6. Kasyanov V. N., Kasyanova E.V. Introduction to programming. - http://pco.iis.nsk.su/ICP

7. Kasyanova E.V. Yazyk programming Zonnon for platforms .NET // Programmnye sredstva i matematicheskie osnovy informatiki. - Novosibirsk: ISI SO RAN, 2004. - P.189-205.

8. Kutuzov A.B. Corpus linguistics. - (Electronic resource): License Creative commons Attribution Share-Alike 3.0 Unported (Electronic resource) -//lab314.brsu.by/kmp-lite/kmp-video/CL/CorporeLingva.pdf

9. Manturov O.V. and dr. Explanatory dictionary of mathematical terms. -M .: Prosveshchenie, 1965. -509 p.

10. Melchuk I.A. Poryadok slov pri avtomaticheskom sinteze russkogo slova (predvaritelntie soobshcheniya) // Nauchno -texnicheskaya informatsiya. 1985, №12. -S.12-36.

11. Nedoshivina E.V. Programs for working with corpus texts: a review of the main corpus managers. Uchebno-metodicheskoe posobie. - St. Petersburg. -2006. 26 p.

12. Safarova R.G. and b. Classification of pedagogical technologies used in the process of modular teaching in general secondary schools. / Methodical manual. - T.: State Scientific Publishing House "National Encyclopedia of Uzbekistan", 2016. -176 p.

13. Stepanov A.N. 6.3. Archiving of file objects // Informatics: basic course: for students of humanities specialties of universities. - Peter, 2010. - 719 p.

14. Explanatory dictionary on theoretical mechanics. -M .: MFTI. 2007. - 68 p.

15. Toirova G. About the technological process of creating a national corps. // Foreign languages in Uzbekistan. Electronic scientific-methodical journal. -Tashkent. 2020, № 2 (31), -B.57- 64. https://journal.fledu.uz/uz/ 2-31-2020

16. National encyclopedia of Uzbekistan. 5 volumes. Volume 1 - Tashkent: State Scientific

Publishing House of the National Encyclopedia of Uzbekistan, -2006. - B.201.

17. Eshmo'minov A. Dictionary of synonyms of the National Corps of the Uzbek language: Doctor of Philosophy (PhD) in Ph.D. aftoref. - Karshi, 2019.

18. Hamroeva Sh. Linguistic bases of creation of the author's corpus of the Uzbek language: Doctor of Philosophy (PhD) in philology. aftoref. -Karshi, 2018. - 52 p.

19. Leech G. The State of Art in Corpus Linguistics // English Corpus Linguistics / Aimer K., Altenberg K. (eds.) - London, 1991. - P. 8-29.

20. Fries Ch.C. The structure of English. An introduction to the construction of English sentences. -L., 1969. - S.98.

21. Zemanek H. Lecture Notes in Computer Sciece 122 (1981), 1-81 [elek.res.] Http://elganzua 124.github.io/ taocp / OEBPS / Text / ch01.html

i Надоели баннеры? Вы всегда можете отключить рекламу.