Developing the lexical search engine’s architecture for the national corpora of the Chuvash language using Java

Zheltov V.; Zheltov P.; Gubanov A.; Skvortsov A.; Gorshkov Yu.

DEVELOPING THE LEXICAL SEARCH ENGINE'S ARCHITECTURE FOR THE NATIONAL CORPORA OF THE CHUVASH LANGUAGE USING JAVA9

Zheltov V.,

Ph.D, Professor, Head of Department, Department of Computer Technology, I.N. Ulyanov's Chuvash State

University Zheltov P.,

Ph.D, Associate Professor, Department of Computer Technology, I.N. Ulyanov's Chuvash State University

Gubanov A.,

Ph.D, Professor, Professor, Department of Russian as a foreign language, I.N. Ulyanov's Chuvash State

University Skvortsov A.,

Assistant, Department of Computer Technology, I.N. Ulyanov's Chuvash State University

Gorshkov Yu.

Senior lecturer, Department of Computer Technology, I.N. Ulyanov's Chuvash State University

Abstract

The article describes in detail the architecture and functionality of the search engine for the national corpora of the Chuvash language, taking into account all the requirements and peculiarities. Search engine is a special program for the national corpora analysis by different queries. The search engine includes the following applications: data collection application; indexing and searching application; application of structuring source data. The search engine was created in Java and executed in Desktop version that allows installing it on the computer of linguist-researcher. This search engine has the following advantages: free license and relative operating speed. The drawback is the necessity to install the additional software. In this regard, it needs some changes in architecture of search engine that is why presently the developers are working on these issues and on expanding the functionality of the search engine, especially morphological analyzer of the Chuvash language, where there are many unresolved problems.

Keywords: search engine, text corpora, text layout, query, indexing.

At present time in order to save the text and lexical affluence of the national languages are created the national text corporas, which are huge structural electronic storages of texts with the ability of quick search on several language levels: morphemic, morphological, syntactic, text and semantic.

Quick search in such corpora is performed by search engines.

The search engine is a special program for the national corpora analysis by means different queries.

The main task of the search engine is providing researchers the opportunities to collect literary texts in automatic data storage, make their research from various points of view and to use the texts or the results of their analysis in their research works.

Accordingly, there are the following formal objectives:

(1) the objective of collecting and indexing of literary texts;

(2) the objective of searching of literary texts;

(3) the objective of the analysis of the literary texts which were found;

(4) the objective of visualization of the founded literary texts.

We suppose that the system of collecting and structuring of the literary texts is necessary for the subsequent retrospective search of the words from the user's queries, in the sentences from works of art. In the basis of the automatic system are the methods of the

text analysis. The system downloads the data from the works of art. After loading, the system structures them, applies the methods of text analysis, and then gives the analysis results to the system user.

Taking into account these principles we created the search engine architecture for the national corpora of the Chuvash language and implemented the search engine itself.

The search engine was created in Java and presently works in Desktop version.

The search engine is the system with the following applications:

(1) data collection application;

(2) indexing and search application;

(3) application of structuring source data.

Let's turn to the description of the developed components of the search engine.

1. Data collection application

The objective of data collection is to collect the texts of works of arts, which the user is going to research.

The general layout of the data collection application we offer consists of two parts: client and server. The client part is a web interface through which the user can: load the file of work of art to the server together with all associated information (the author of the work, name of the work of art, etc.); to look through all the

9 The publication was made in the scope of the scientific project №15-04-00532 supported by the Russian Foundation for Humanities (RFH).

uploaded to the server works of art; to remove the works of art from the server. The server part handles incoming user queries and gives him its results. The server part for data storage of the works of art uses Google AppEngine platform (look https://develop-ers.google.com/appengine/); in addition to saving the text of the work of art and attributes specified by the user (the author of the work, name of the work of art, etc.) the server automatically saves the information about the time of downloading the file, file size, type and name of the file, and also the information if the work of art was structured by structuring application.

2. Indexing and search application

So, as it was mentioned before, to work with the large amounts of data large calculating powers and disk space are needed. The number of digital literary texts is great, so, the search of the required sentences without preliminary text processing requires large calculating resources and much time. That is why the system uses the indexing of the sentences of the literary texts. The sense of the indexing is in the ability to add, remove

and update the documents in the data storage (1 document = 1 sentence), which later is used for full-text information research. Such process is done by the component named indexer. Modern search systems use and constantly improve their indexing algorithms which are closed. Of course, there are a lot of free indexing systems, but they often have relatively modest characteristics. The other component is the search engine which accepts the query and by processing the database, selects data that match the query. Moreover, the search engine for database can calculate the additional parameters for search results (to rank, to calculate the degree of compliance with the query, etc.).

In data collection application of the indexing system Lucene (http //lucene.apache.org) is used, because:

• full range of options in Lucene is realized with the help of Java API (the system of data collection is created in Java);

• Lucene is designed for embedding in the other applications;

• Lucene builds and gives the opportunity to work with non-monolithic index.

index

ППМ* 1 ■*■ Компьютер ■*■ develop (Ei) ■*■ workspace3.7 ■*■ TextAnalyser ■*■ data ■*■ index т J 1 Поиск: index m

Упорядочить т Добавить в библиотеку * Общий доступ ▼ Новая папка m " ■ @

Элементов: 58

Избранное Ц Загрузки

Недавние мест S Рабочий стол ! Dropbox

Ц Библиотеки Щ Видео ^ Документы В Изображения ^ Музыка

Домашняя групп

U Компьютер V main [С:) U media [D:) tyi develop {: :! U ssd(F:)

j Дисковод BD-R в programs :H:)

0 Сеть

■ _0.«t ■ _0.«x Щ _0.fnm

Л J *4 ^ _Ci.nrm ■ _0.prx

M _О.Й ■ _0.tis ■ _0.tvd

M _0.tvf ■ _0.tVX ■ _LHt

■ _1,Ях H _Lfnm fl i-Ьч

M _L.nrm ■ _Lti

■ _l.tis H _Ltvd ■ Ltvf

|_l.tn M _2.fdt Щ _2.fdx

M _2.fnm Л _2.frq Я _2.nrm

Л -i рта ■ _2.til Щ _2,tis

M -2tvd Щ _2.tvf ■ _2.tyx

(|_3.fdt ■ _3.fdx M _3.fnm

_3.nrm Ш _3.prx

■ _3.tis ■ _3.tvd

É_3.m Щ _3.tvx ■ _4.fat

i« _4.«x ^ _4.fhm 4 _4.frq

^ _4.nrm 4 _4.prx ■ -l.tii

s _4.tis S _4.tvd Я -'"•f

S _4.tvx 'S segmenta.gen

'S segments_l

Figure 1. Index files in Explorer (used in demo application)

3. Application of structuring source data

Data received by the collection system, initially are "raw" and to work with them in this situation is impossible. Before proceeding to analyze the collected data, it is necessary to structure them. In this regard, the questions arise: 1) how to structure the data? What to begin with? 2) What degree of structuredness is necessary?

To answer these questions it is necessary to fix requirements that are to be met by received structure:

• it is necessary to distinguish the logical entities;

• it is necessary to establish relationships between these entities;

• it is necessary to provide sufficient access speed to the logical entities;

• it is necessary to avoid redundancy of data

storage to save memory.

Such requirements lead us to the necessity of the relational databases usage. To work with the relational databases it is necessary to have relational database management system (DBMS). In our system of text analysis DBMS MySQL is used. DBMS MySQL meets the following requirements: (1) free distribution; (2) work on Windows; (3) the presence of drivers to work with applications in Java; (4) the ability to use as SQLserver. MySQL is distributed under the license GNU (General Public License). The main advantages of MySQL are the safety, operation speed, reliability, portability to other platforms, low demands on calculating resources.

So, when the lexical database with tables is created (physical data model), which reflects our logical entities (logical data model), it is necessary to fill in this database with figures. The application of structuring the data works according to the following algorithm:

Step 1. Sort out all the uploaded works of art. Perform steps 2 - 7 for each work of art.

Step 2. Fill in the table of the works of art WORK.

Step 3. Fill in the table of the authors AUTH.

Step 4. Fill in the table of the authors-works of art AUTH_WORK.

Step 5. The text of the work of art is divided into sentences by text analysis method. It is necessary to follow the following steps: (1) the table SENTENCE is filled in by the given sentence; (2) the document index is formed on the basis of the given sentence (the sentence is indexed), moreover, the field "ID" of the document coincides with the field "id" of the SENTENCE table.

So, the following applications were developed for

logical and physical data models for our problem domain, using the theory of normal forms.

So, the logical entities, that will be in database, are: the author, literary text, sentence. The author is the person who creates the literary texts. Each literary text can be associated with several authors, and also each author can be associated with different literary texts (many-to-many relationship). The literary texts consist of the sentences. Each sentence is associated with one literary text (many-to-one relationship).

Here is the list of the main objects of the physical model of the database. The names of the tables are bolded, after the name there is a brief description of the table, then there are the rows of fields: the name of

the search engine:

1) application of the literary texts collection;

2) application of the literary texts indexing;

3) application of the structuring of the works of arts and database.

Conclusions

Consequently, at this stage the lexical base of the language was developed. It can be added by the works of art and structure them automatically.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

The publication was made in the scope of the scientific project .№15-04-00532 supported by the Russian Foundation for Humanities (RFH).

References

1. Желтов П.В. Сравнительные исследования морфем чувашского языка. -Изд-во Чуваш. ун-та, 2013. - 165 с.

2. Желтов П.В. Исследования исторического развития чувашского языка. -Изд-во Чувашского ун-та, 2013. - 165 с.

However, to meet the target demands it is not field, type of field and its brief description. enough to use only DBMS. It is necessary to develop_

AUTHOR

Author

id (key) Full

name Line Name

surname Line Surname

patronymic Middle name

WORK

Work of art

id (key) Full

Date of creation Data Date of creation

url Line URL of the work of art (where it was downloaded from)

title Line Name of the work of art

text Line Text of the work of art (is used for caching, if the text size is small)

SENTENCE

Sentence

id (key) Full

text Line Text of the sentence (is used for caching, if the text size is small)

work id Full Work of art (outer key for the work of art)

prev snc id Full Link to the sentence prior to this in the literary text (outer key)

next snc id Full Link to the sentence following this in the literary text (outer key)

AUTH WORK

Defines "many-to-many" relationships between the authors and the works of art

id (key) Full

auth id Full Author (outer key to the author of the work of art)

work id Full Work of art (outer key to the work of art)

3. Желтов П.В. Модели поиска и копирования символьных данных на J-сетях/Шрикладная информатика. - 2012. - №4 (40). - С. 81-83.

4. Желтов П. В. Сопоставительно-сравнительное исследование морфем чувашского языка с применением формальных методов: дис.. канд. фил. наук. Чебоксары, 2010. - 194 с.

5. Барковский С.С., Желтов П.В., Лукашов А.М. Подход к формализации модели семантической структуры текста в системах документообо-рота//Вестник Казанского государственного технического университета им. А.Н. Туполева. 2010. № 2. С. 96-100.

6. Zheltov P., Fomin E., Luutonen J. Reverse Dictionary of Chuvash. Обратный словарь чувашского языка. Societe Finno-Ougrienne, 344, 2009.

7. Желтов П. В. Моделирование многоагент-ных систем сетями Петри. Чебоксары: Изд-во Чуваш. ун-та, 2008. - 108 с.

8. Желтов П.В. Лингвистические процессоры, формальные модели и методы: Теория и практика. -Чебоксары: Издательство Чувашского университета, 2006. -208 с.

9. Желтов П.В. Лингвистические сети для представления схем следрвания аффиксов.//Вест-ник чувашского университета. -2006. -№ 2. -С. 297303.

10. Желтов П.В. Сетевые модели для анализа, синтеза и коррекции словоформ. Чуваш. гос. ун-т. Чебоксары, 2004. Деп.в ВИНИТИ 19.02.2004, № 200-В2004.15 с.

АНАЛИЗ ХИМИЧЕСКОГО ФАКТОРА ПРИ ДОБЫЧЕ НЕФТИ ШАХТНЫМ СПОСОБОМ

Климова И.В.

Кандидат технических наук, доцент кафедры Промышленная безопасность и охрана окружающей среды, ФГБОУ ВО «Ухтинский государственный технический университет»

CHEMICAL ANALYSIS FACTOR IN OIL PRODUCTION MINING METHOD

Klimova I.

Candidate of Technical Sciences, associate professor of the Department of Industrial safety and environmental protection, «The Ukhta state technical university»

Аннотация

В работе рассмотрена технология добычи нефти шахтным способом на примере Ярегского месторождения высоковязкой нефти и связанные с ней условия труда работников по химическому фактору; подробно представлен компонентный состав рудничной атмосферы и факторы, влияющие на него.

Abstract

This article considers the technology of oil production by mining method on the example of the Yarega field of heavy oil and related conditions of employment according to the chemical factor; described component composition of the mine atmosphere and the factors affecting it.

Ключевые слова: высоковязкая нефть, нефтешахта, рудничная атмосфера, условия труда, химический фактор.

Keywords: heavy oil, oil mine, mine atmosphere, working conditions, chemical agent.

Многолетняя практика эксплуатации нефтяных шахт показала особую трудоемкость процессов, тяжелые условия труда обслуживающего персонала (затемненность, повышенная температура и влажность воздуха, опасность обвалов). Кроме того, повышенные требования предъявляются к состоянию рудничной атмосферы, характеристикам технических средств автоматизации и электроснабжения.

Сохранение жизни и здоровья работников нефтешахт до сих пор является трудновыполнимой задачей системы управления охраной труда.

В данной работе рассмотрены токсикологические и физико-химические характеристики вредных веществ, присутствующих в шахтной атмосфере, особенности воздействия этих веществ на организм человека. От правильного проветривания шахт зависит безопасность всего технологического процесса, санитарно-гигиенические характеристики условий труда, а также производительность труда. Задача по созданию системы непрерывного

контроля за составом рудничной атмосферы и количеством подаваемого чистого воздуха до сих пор актуальна.

Несмотря на то, что применение термошахтной добычи нефти позволило существенно увеличить объемы добычи, проводятся опыты по дальнейшему повышению нефтеотдачи пластов за счет применения химических веществ, а именно, поверхностно-активных веществ (ПАВ) и щелочей.

Но, и при термощелочном методе, когда к перегретому пару добавляется щелочь, возможно образование ПАВ вследствие химической реакции щелочи с некоторыми компонентами нефти. Соответственно, это влечет за собой изменение состава рудничной атмосферы и ухудшение условий труда.

В России шахтным способом эксплуатация ведется только на Ярегском месторождении тяжелой нефти, поэтому определить влияние всего многообразия природных факторов на производственную деятельность нефтяных шахт не представляется возможным.

Developing the lexical search engine’s architecture for the national corpora of the Chuvash language using Java Текст научной статьи по специальности «Компьютерные и информационные науки»

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Zheltov V., Zheltov P., Gubanov A., Skvortsov A., Gorshkov Yu.

Похожие темы научных работ по компьютерным и информационным наукам , автор научной работы — Zheltov V., Zheltov P., Gubanov A., Skvortsov A., Gorshkov Yu.

Текст научной работы на тему «Developing the lexical search engine’s architecture for the national corpora of the Chuvash language using Java»