Научная статья на тему 'Some elaboration methods for written and spoken multilingual databases'

Some elaboration methods for written and spoken multilingual databases Текст научной статьи по специальности «Языкознание и литературоведение»

CC BY
203
45
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
ЛИНГВИСТИЧЕСКИЕ БАЗЫ ДАННЫХ / УСТНО-РЕЧЕВЫЕ БАЗЫ ДАННЫХ / ТЕРМИНОЛОГИЧЕСКИЕ БАЗЫ ДАННЫХ / НАНОТЕХНОЛОГИИ / ОБЛАЧНЫЕ ТЕХНОЛОГИИ / WRITTEN LANGUAGE DATABASES (WLDB) / SPOKEN LANGUAGE DATABASES(SLDB) / TERMINOLOGY DATABASES / NANOTECHNOLOGY / CLOUD TECHNOLOGY

Аннотация научной статьи по языкознанию и литературоведению, автор научной работы — Potapova Rodmonga, Potapov Vsevolod

This paper presents some elaboration methods for written language databases (WLDBs) and spoken language databases (SLDRs) in Russia. The paper is focused upon an area that can be referred to the development of written and spoken databases from the historical point of view with regard to new trends of techniques, themes and methods of investigation and annotation. For written language databases (WLDBs) the terminology of written text data is investigated for the Russian and English languages in the field of modern nanotechnologies. For multilingual spoken language databases (SLDBs) the authors intend to explore the application of the acoustic methods of segmentation and transcription of the speech flow. The verbal inventory of WLDBs and SLDBs includes information item s connected with Russian and English for WLDBs and for the first time some other languages of the former USSR (for SLDBs) with regard to using these databases for cloud storage technology.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

НЕКОТОРЫЕ МЕТОДЫ РАЗРАБОТКИ МНОГОЯЗЫЧНЫХ ЛИНГВИСТИЧЕСКИХ БАЗ ДАННЫХ ПРИМЕНИТЕЛЬНО К УСТНОЙ И ПИСЬМЕННОЙ РЕЧИ

В статье освещены некоторые методы разработки лингвистических баз данных применительно к устной и письменной речи, развиваемые исследовательскими коллективами в России. Рассматриваемая проблемная область представлена как в историческом ракурсе, так и с позиции современных тенденций развития исследовательских методов и методик аннотирования. Формирование письменно-речевых баз данных рассмотрено на примере исследования русско-английских терминологических соответствий в области современных нанотехнологий. Применительно к многоязычным устно-речевым базам данных авторы рассматривают акустические методы автоматизированной сегментации и транскрибирования речевого потока. Вербальный контент письменно-речевых и устно-речевых баз данных включает единицы русского и английского языков, а также (в части устно-речевых баз данных) некоторых языков стран бывшего СССР применительно к задаче интеграции упомянутых баз данных с современными облачными технологиями.

Текст научной работы на тему «Some elaboration methods for written and spoken multilingual databases»

Вестник Московского университета. Серия 9. Филология. 2019. № 3

Rodmonga Potapova, Vsevolod Potapov

SOME ELABORATION METHODS FOR WRITTEN AND SPOKEN MULTILINGUAL DATABASES1

Moscow State Linguistic University

38 Ostozhenka, Moscow, 119034

Lomonosov Moscow State University

1 Leninskie Gory, Moscow, 119991

This paper presents some elaboration methods for written language databases (WLDBs) and spoken language databases (SLDRs) in Russia. The paper is focused upon an area that can be referred to the development of written and spoken databases from the historical point of view with regard to new trends of techniques, themes and methods of investigation and annotation. For written language databases (WLDBs) the terminology of written text data is investigated for the Russian and English languages in the field of modern nanotechnologies. For multilingual spoken language databases (SLDBs) the authors intend to explore the application of the acoustic methods of segmentation and transcription of the speech flow. The verbal inventory ofWLDBs and SLDBs includes information item s connected with Russian and English for WLDBs and for the first time some other languages of the former USSR (for SLDBs) with regard to using these databases for cloud storage technology.

Key words: written language databases (WLDB); spoken language databases (SLDB); terminology databases; nanotechnology; cloud technology.

Introduction

Written and spoken language databases (WLDBs and SLDBs, respectively) are a versatile many-sided field of applied, experimental, and mathematical linguistics, which is the resource base for creation of data arrays, collections, and bases using natural language texts. Nowadays WLDBs and SLDBs determine the development of such important areas as: recognition and understanding of texts in automated and control systems; creation of automated systems for conversion of written texts into spoken

Rodmonga Potapova — Prof. Dr., Institute of Applied and Mathematical Linguistics, Moscow State Linguistic University (e-mail: [email protected]).

Vsevolod Potapov — Prof. Dr., Senior Researcher, Faculty of Philology, Lomonosov Moscow State University (e-mail: [email protected]).

1 This research is supported by the Russian Science Foundation, Project № 18-18-00477 (head of the research project: Dr. habil., Prof. Rodmonga Potapova).

ones and vice versa; development of terminological dictionaries and computer corpora for various languages and dialects ofthe world, etc. Thus, WLDBs and SLDBs "serve" a number ofbranches in the theory and practice of verbal communication on the basis of digital technologies. WLDBs and SLDBs began to form in the second half of the 20th century. During this time, dozens of text data banks were created primarily for English and later for other European languages and language pairs; hundreds of dictionaries were created based on written text corpora. The predominant part ofWLDB products of the early stages of its development in Russia covers corpora of written texts.

In Russia, the first experience of creating of a big language corpus was the Computer Fund of the Russian language. The purpose of creation of the Computerfund ofthe Russian language was formation of a representative corpus and sub-corpora of various texts of the modern Russian language, and special software to annotate them [Andryushchenko, 1989; Potapova, Bobrov, 2015]. The Fund currently serves internal tasks of the Russian Language Institute (RLI) of the Russian Academy of Sciences (RAS), such as maintenance of the Russian dialectological atlas; creation of an automatic concordance for texts of Russian folklore, political texts, texts of the Old Russian language of of 11th-17th centuries, etc. Each task requires a separate software package. The structure of the Computer Fund of the Russian language included a representative number of dictionaries [Arefyev, Panchenko, Lukanin, Lesota, Romanov, 2015; The collection "Corpus linguistics in Russia", 2003]. These projects are focused on the modern standard of Russian with regard to its various genres and styles. The Association "Russian National Corpus" was established, which includes a large group of linguists from Moscow, St.-Petersburg, Novosibirsk and other scientific centers of Russia [Arefyev, Panchenko, Lukanin, Lesota, Romanov, 2015; The collection "Corpus linguistics in Russia", 2003; Melchuk, 1999; Sharov, 2003; Russian National Corpus: wwwruscorpora. ru]. The text corpora (TC) for the above projects are formed with regard to the following annotation parameters: personalities, genre, style, source date, scientific fields, etc.

These WLDBs are provided with special linguistic information: morphological, syntactic, semantic, pragmatic, etc. [Sharov, 2003].

Spoken language databases (SLDB) of Russian with speech resources are being actively developed in Russia since the second part of the 20th century [Krivnova, Zakharov, Strokin, 2001; Potapova, Potapov, Abramov, Khitina, Bobrov, Maslov, 2011]. Creating spoken language databases is today a priority regarding the relevance of the problem of automated speech recognition and understanding, identification and verification of a speaker by voice and speech, acmeology profiling of personality and social groups [Bogdanova-Beglarian, Martynenko, Sherstinova, 2015; Bolshakova,

Potapova, Lobanov, 2013; Glavatskin, Platonova, Rogozhina et al., 2015; Krivnova, Zakhrov, Strokin, 2001; Popli, Kumar, 2017; Potapova, 2009; 2011; Potapova, Potapov, 2016; Potapova, Potapov. Rakhimberdiev, 2011], speech synthesis, speech communication, interpretation, etc.

In today's world of rapidly evolving information technologies, when "smart" houses with voice recognition devices have become reality, some "building material" for such systems is needed. Among these "building blocks" are big data in spoken language [Bogdanova-Beglarian, Martynenko, Sherstinova, 2015; Glavatskin, Platonova, Rogozhina et al., 2015; Juzovd, Tihelka, Matousek, 2016; Potapova, 2014; Schuller, 2017; Shirokova, Platonova, Smolina, et al., 2015; Valenta, Smidl, 2015]. Formation of representative SLDBs is one of the conditions for successful solution of applied problems.

All above mentioned investigations focus first of all on the texts and discourses for Russian and only a part of them on multilingual written and spoken verbal material [Potapova, 2009; 2014]. A brief overview of existing WLDBs and SLDBs in Russia is made, which demonstrates the lack of multilingual verbal data available for modern technology applications, e.g. automatic text processing, speech analysis and synthesis, speech and language recognition, speaker verification and identification, the use of cloud services as a new organization of big data infrastructure, etc. [Bolshakova, Potapova, Lobanov, 2013; Potapova, 2011; 2012; 2014; Potapova, Andreyev, Lednov, Rogozhina, Bobrov, 2012; Potapova, Bazhenova, Potapov, Bobrov, 2015; Potapova, Bobrov, 2015; Potapova, Efremenko, 2012; Potapova, Lebedev, Bobrov, 2011; Potapova, Potapov, 2016; Potapova, Potapov, Bobrov, 2011; Potapova, Potapov, Rakhimberdiev, 2011; Potapova, Sharov, Bobrov, 2011; Shevchenko, Pozdeeva, 2017].

Elaboration of Russian — English nanotechnology WLDBs

The present investigation is based on the creation of bilingual WLDBs of terms in the sphere of nanotechnology for English and Russian languages. Using a systematic approach and recycling data a certain amount of terms was extracted from the field of nanotechnology including terms with the prefix 'nano-' [Ahlbom, Bridges, et al., 2007]. The complicated tasks of the elaboration of WLDBs are verbal semantic non-coincidence of linguistic items (semes, words, phrases, sentences, etc.) [Wen, Tan, Wang, Li, Gao, 2013]. The terms were provided with a definition and an example of usage in both languages. The first direction of the study is an attempt to systemize the collected terms by subsumption relations, to suggest a possible way of grouping nanoterms and point out the importance of the lexicographic activity of the kind. The main target is to collect terms with

the prefix 'nano-' in Russian and English languages and find some common grounds between this term system and other term systems extensively used in the language of modern science.

The terminology items as a base for WLDBs denote:

• the inventory of technical nano-items which designate a particular subject field (e.g., big-, medicine-, electronic-, chemical- etc. nanotechnology items);

• the linguistic categories and rules for using words and sentences in text;

• the concordances of nano-items in texts and sentences [Glaeser, 1993].

The motivation of this research, was one of the major problems of multilingual automatic text processing: Word Sense Disambiguation. This problem has causal dependence with regard to syntactic analytics and synthetics peculiarities of both languages. Nanoterms and their definitions are analyzed from the linguistic point of view to prove that they correspond to the principles of lexicology. Definitions are grouped according to the determiner used in them. It is also pointed out that the mechanisms of synonymy, homonymy and polysemy hold true within this term system. The Results of the study are presented showing that the majority ofterms used in nanotechnology are broadly self-explanatory, and most of the concepts seen at very small dimensions can be described by the existing terminology used at larger scales. The research has the aim to show advantages of bilingual WLDBs of nanoterms and its irreplaceability. During the research some electronic media resources and scientific articles on the topic of nanotechnology both in English and in Russian were studied. Using a combinational approach (combining information from various sources) a certain amount ofterms was extracted. The terms were provided with a definition and a contextual example of usage in both languages. It was a matter of great significance that the definitions should be accurate, full and non-contradictory, and examples provided should make the terms more transparent. It was also important to avoid ambiguities. The main result of the study is that the majority of terms used in nanotechnology are broadly self-explanatory. Most ofthe concepts seen at very small dimensions associated with nanotechnology are not new, and can be described by the existing terminology used at larger scales. Many of the terms used in the field of nanotechnology are based on commonly used words that give a name of a science or some well-known device, and they do not conflict with the general meaning of such words.

In most cases the nature of a nanoterm is determined by the word that follows and there is no need to change the meaning of any scientific term, such as metre or material just because it is pre-fixed by 'nano-'.

After studying the given nanoterms it was observed that the definitions either give the same term without the prefix 'nano-' (bomb for nanobomb, shell for nanoshell etc.), its synonyms (machine for nanorobot, weapon for nanobomb) or derivations (crystalline material for nanocrystal, optical engineering for nanooptics) as a determiner, or introduce such words as nanobject, nanodevice, nanostructure, molecular device, material, science, end item, development and so on.

The following linguistic phenomena were found during the research: four terms (nanobomb, nanorobot, nanomachine and nanobodies) are polysemantic and have two meanings that usually do not drastically differ from each other; two cases of synonymy were found in the Russian language (mmpo6om — HaHOMawma, HaHOMomop — Hamdemamenb). The English language provides more synonyms (e.g. nanorobot — nanobot — nanite — nanomachine, nanoshell — core-shell), which can be explained by the fact that nanotechnology is more popular abroad than in this country; a pair of homonyms (HaHOMawma as nanomachine vs. HaHOMawuHa as nanocar).

After subjecting the given nanoterms to the linguistic analysis regarding their lexicographical description cases of synonymy, polysemy and homonymity were revealed and analyzed. It was observed that the definitions in the field of nanoscience have some typical features that could help to systematize and classify the terms and definitions in the given area. The results demonstrate that for some nanoterms there is a logical explanation of meaning through the conventional meaning of the constituent parts of the term and in some cases a descriptive noun is used to define a term. It also turned out that the language of nanoscience is not devoid ofsome linguistic phenomena that are usually seen when considering any other semantic field.

The findings mentioned above indicate the advantages of bilingual WLDBs in the field ofnanotechnology. It is essential for crossing disciplinary and language boundaries especially in the age of rapidly developing computer- and nanotechnologies. As for the terminology itself, it is broadly self-explanatory. Most of the concepts seen at very small dimensions associated with nanotechnology are not new, and can be described by the existing terminology used at larger scales. So it can be concluded, that in spite of the growing number of nanoterms which can be explained by the growing interest in nanotechnology itself, it is not impossible to structure and systematize the terms within this nascent term system.

This research shows the necessity of creating the nanodatabase and it suggests a number of actions and recommendations that should be discussed and considered: lexical units are to be systemized from the terminological point of view; stylistic labels are to be made; examples of linguistic compatibility are to be pointed; illustrations are to be added. The

term sometimes applies to some microscopic technology. The size scale of nanotechnology yields to quantum-based phenomena, which yields often counterintuitive results. These nanoscale phenomena include quantum size effects and molecular forces such as van der Waals forces. Furthermore, the vastly increased surface-to-volume ratio opens new possibilities in surface science, such as catalysis.

The device density of modern computer components continues to grow exponentially, but fundamental electronic limitations prevent the trend of Moore's law to continue. Current estimates predict ten to fifteen years of continued improvement before economic costs grow exponentially. Nanotechnology is seen as the next logical step for continued advances in computational architecture. The term nanotechnology is often used interchangeably with molecular nanotechnology (also known as "MNT"), a hypothetical advanced form of nanotechnology that is believed to be developed far in the future, although estimates vary. The term nanoscience is used to describe the interdisciplinary field of science devoted to the advancement of nanotechnology. Hereby the research displays the mere need in WLDBs that are now in the process of creation and replenishment.

Among the requirements for the formation of a correct terminological database there is unambiguity of terms, the solution to the problem of synonymy, strict definitions, determinancy of conceptions, the presence of term equivalents in other languages and so on. Terminological WLDBs being elaborated at Moscow State Linguistic University may be converted into a specific dictionary for machine translation systems. The necessary attributes of the composition of the dictionary like this are defined. Nanoelectronics are the branch of nanotechnologies, a relatively new science which is for the time being not sufficiently studied. The terminology of nanoelectronics is not yet settled because of the impetuous development of this science.

The special linguistic analysis of terms, e.g. from the field of nanoelectronics was carried out. The first step ofthe linguistic analysis was the formation ofthe so-called data-conceptual apparatus. The data-conceptual apparatus of any science with wide branches is formed on the base of terminologies of other directions of human activities. In its turn, the data-conceptual apparatus is closely connected with conceptual condensation of the terminology. The formation of the data-conceptual apparatus allowed determining theoretical or practical fields of nanoelectronics that are acceptable for selecting terms. As a result the terminology became already divided into several large sections and it was convenient to research the terminology within each section. The second step of the analysis was based on discovering basic conceptions which could be related to the terms. In

fact this is the first procedure towards the orderliness of data in the future database. As for the third step of the analysis, a specific classification of conceptions was built. The fourth step of the analysis may be regarded as the fundamental one. The data-terminological system ofterms in the field of nanoelectronics was created [n=10.500]. As a result specific terminological families of words were formed with at least one basic concept. In fact after implementing this procedure the data in the data-base became well-ordered to some extent. At the fifth step when the terms became correlated with concepts it became possible to determine the so-called archisemes. Differential semes which make a term's meaning individual are to be subordinated to archisemes, see fragment of Table 1.

This investigation presents the fragment of the written language database (WLDB) analysis for semantic items (semes) extraction. The "word-to-word" translation regarding technologies texts would be inadequate without the differential semes analysis.

The analyses showed that the number and character of differential semes depended upon the number of extra words in a term which make it complex. Analyzing differential semes it was also possible to make the classification of terms, deriving common terms and complex specific terms. One of the tasks was to find the same differential semes in the definitions of terms to make the data in the data-base still more well-ordered and organized.

Table 1

Extraction of the linguistic differential semes for Russian and English and definition of common ones

Russian definition terms English definition terms

НАНОАНТЕННА — УСТРОЙСТВО В ЭЛЕКТРОНИКЕ, представляющее собой нанометровые спирали, принимающие и отправляющие электромагнитные волны NANOANTENNA — A DEVICE IN ELECTRONICS presenting nanometer helices sending and transmitting electromagnetic waves

АРМАТУРА — УСТРОЙСТВО В ЭЛЕКТРОТЕХНИКЕ, представляющее собой один из двух основных электрических компонентов электромеханического механизма — двигатель или генератор FITTING — A DEVICE IN ELECTRICAL ENGINEERING, which is one of the two principal electrical components of an electromechanical machine — a motor or generator

АТОМНО-СИЛОВОИ МИКРОСКОП — УСТРОЙСТВО В МОЛЕКУЛЯРНОЙ И АТОМНОЙ ФИЗИКЕ, представляющее собой сканирующий зондовый микроскоп высокого разрешения, измеряющегося в нанометрах, действие которого основано на возникновении Ван-дер-Ваальсовых сил при взаимодействии иглы зонда с поверхностью исследуемого образца ATOMIC-FORCE MICROSCOPE — A DEVICE IN MOLECULAR AND ATOMIC PHYSICS, which is a very high-resolution type of scanning probe microscope, the resolution being measured in nanometers, the force of which is based upon the appearance of Van-der-Waals forces by the interaction between the probe needle and the surface of a researched pattern

Russian definition terms English definition terms

НАНОСЕНСОР — НАНОУСТРОЙ- СТВО, представляющее собой двумерный фотонный кристалл и используемое для исследования наночастиц, а также способное детектировать отдельные вирусы NANOSENSOR — A NANODE-VICE, which is a two-dimensional photonic crystal used for the research of nanoparticles and for the detection of certain viruses

НАНОРЕПЛИКАТОР — НАНО-УСТРОЙСТВО, представляющее собой молекулярный ассемблер, который из атомов окружающей среды собирает молекулы, размещающиеся так, чтобы образовалась определенная вещь NANOREPLICATOR — A NANO-DEVICE presenting a molecular assembler constructing molecules from atoms of the environment, these molecules being arranged to form a certain object

НАНОТРАНЗИСТОР — ПОЛУПРОВОДНИКОВОЕ НАНОУСТРОЙСТВО переключения электрических сигналов или усилитель NANOTRANSISTOR — A SEMICONDUCTING NANODEVICE that acts as an electronic signal switch or amplifier

АТОМНО-СИЛОВАЯ МИКРОСКОПИЯ — СОВОКУПНОСТЬ МЕТОДОВ изучения объектов с использованием микроскопа, основанное на возникновении Ван-дер-Ваальсовой силы притяжения между атомами, образующими острие, и атомами, расположенными на поверхности образца, к которому подводится зонд на расстояние в несколько ангстрем ATOMIC FORCE MICROSCOPY — A TECHNOLOGY of the research of objects with the usage of a microscope, based on the appearance of Van der Waals forces between atoms, which form an edge, and atoms, which are located on the surface of a pattern, to which a probe is brought at a distance of several angstroms

АДРОН — ЭЛЕМЕНТАРНАЯ ЧАСТИЦА, подверженная сильному воздействию со стороны другой такой частицы и не являющаяся истинно элементарной HADRON — AN ELEMENTARY PARTICLE, which is subjected to a strong influence from another similar particle and which is not truly elementary

АНИОН — ОДНОАТОМНАЯ ИЛИ МНОГОАТОМНАЯ ЭЛЕКТРИЧЕСКИ ЗАРЯЖЕННАЯ ЧАСТИЦА, представляющая собой отрицательно заряженный ион, образующийся, когда атом приобретает электроны в результате реакции ANION — A MONOATOMIC OR POLYATOMIC ELECTRICALLY CHARGED PARTICLE presenting a negatively charged ion formed when an atom obtains electrons in a reaction

АТОМ — МИКРОСКОПИЧЕСКАЯ ЭЛЕКТРОНЕЙТРАЛЬНАЯ ЧАСТИЦА, представляющая собой наименьшую часть химического элемента, являющаяся носителем его свойств и состоящая из атомного ядра и окружающего его электронного облака, которое в свою очередь состоит из отрицательно заряженных электронов, тогда как ядро состоит из положительно заряженных протонов и электрически нейтральных нейтронов ATOM — A MICROSCOPIC UNCHARGED PARTICLE, which is the smallest unit of a chemical element, which retains the chemical properties of this element and consists of an atomic nucleus and an electron cloud, which in its turn includes negatively charged electrons, while the nucleus includes positively charged protons and electrically neutral neutrons

Russian definition terms English definition terms

ВАН-ДЕР-ВААЛЬСОВЫ ВЗАИМОДЕЙСТВИЯ — ФИЗИЧЕСКОЕ ЯВЛЕНИЕ возникновения притягивающих или отталкивающих сил взаимодействия между молекулами (или между частями одной и той же молекулы), определяющие формирование пространственной структуры биологических макромолекул и отличные от тех, которые возникают благодаря кова-лентным связям или электростатическому взаимодействию между ионами VAN DER WAALS FORCES — A PHYSICAL EFFECT of the appearance of the attractive or repulsive forces between molecules (or between parts of the same molecule), determining the formation of a space structure of biological macromolecules and being other than those due to covalent bonds or to the electrostatic interaction between ionsr

ТУННЕЛЬНЫЙ ЭФФЕКТ — ЯВЛЕНИЕ КВАНТОВОЙ ПРИРОДЫ, при котором происходит преодоление микрочастицей потенциального барьера в случае, когда её полная энергия меньше высоты барьера TUNNEL EFFECT — A QUANTUM EFFECT, where micro particle surmounts a potential barrier, when its full energy is less than the barrier height

A special stage ofthe analysis was the construction of consistent Russian and English def in itions of each term regarding all possible contacts. That was done on the base of either common semantic information of both Russian and English definitions or through joining different semantic elements from definitions in the two languages.

To summarize the methods of the research the whole analysis may be displayed as a succession of the following procedures: formation of the data-conceptual apparatus; discovery of concepts connected with terms; formation of the classification within the system of concepts; formation of the data-terminological system; determination of archisemes and finding or formation of the same differential semes; formation of consistent definitions. This investigation reflects the width of data material regarding two languages, term and seme systems including terminology for pre- and post-editing tasks of machine translation.

Versatile SLDBs (Spoken Language Databases) is one of the priority trends of the modern speechology

The majority of automated systems constructed today for working with spoken language (speech) databases in some way. In particular, SLDBs are used where probabilistic and statistical methods are applied for speech signals analysis and synthesis. First of all, the following should be mentioned: automatic systems for speech recognition and synthesis; identification and verification of a speaker by voice and speech; identification of psychophysical and emotional state of a speaker by their speech; training systems, screening speech tests, etc. [Juzovâ, Tihelka,

Matousek, 2016; Krivnova, Zakharov, Strokin, 2001; Popli, Kumar, 2017; Potapova, 2009; 2011].

SLDBs form the basis of automated systems that are designed to: collect and store speech messages in their spoken form; search and output recorded speech messages on request (e.g., automated systems for receiving speech messages in call centers, and systems for testing communication channels) [Potapova, 2014; Potapova, Efremenko, 2012; Potapova, Potapov, Abramov, Khitina, Bobrov, Maslov, 2011].

The first SLDB appeared in the second part of the 20th century in the USA, where their development was funded primarily by the Ministry of Defense. With the support of the Ministry, the following corpora were created: TI-DIGITS corpus to test the systems for recording isolated digits and digit sequences; Road Rally for analysis and identification of key words; King Corpus for speaker identification systems, etc. [Potapova, 2009].

The Linguistic Data Consortium (LDC) offers speech corpora which together contain hundreds of hours of spoken language. The Oregon Technology Center (CSLU — Center for Spoken Language Understanding) collects, annotates and distributes telephone speech corpora. The Center activity is supported by industrial sponsors. The collected corpora are available to universities around the world for free. This Center also has a multilingual corpus to assess language identification algorithms, which consists of fragments of spontaneous speech in eleven various languages. In 1995, the European Language Resources Association (ELRA) was established in Europe. This center has SLDB for most official languages of the European Union: for the British and Scottish variants of English, Dutch, Danish, Swedish, German, French, Italian, Spanish languages, and several multilingual corpora. Currently, due to implementation of the Copernicus program, ELRA also distributed speech corpora for Eastern European languages (Polish, Bulgarian, Estonian, Romanian and Hungarian). On the website of the European Association on the Internet, speech corpora for the Russian language can be found, too [Potapova, 2009].

The RuSpeech corpus is a speech database (SLDB), which contains fragments of continuous Russian speech with the corresponding text, phonetic transcription and additional information on speakers. Cognitive Technologies were aiming to create a speaker-independent system for continuous speech recognition. At present, the RuSpeech includes more than 50 thousand utterances with phonetic marking of each spoken utterance. To create the corpus, 220 speakers were invited; each uttered 250 sentences in average. The RuSpeech contains about 50 hours of continuous speech in the volume of15 GB recorded on more than 30 CDs, which is more than the volume of similar English speech databases WSJ Speech and TIMIT. The speech interface consists of a system for dialogue scripts, speech synthesis via text, and a system for recognition of speech

commands [Bogdanova-Beglarian, Martynenko, Sherstinova, 2015; Krivnova, Zakharov, Strokin, 2001; Potapova, 2009; 2011; Potapova, Efremenko, 2012; Potapova, Potapov, Abramov, Khitina, Bobrov, Maslov, 2011].

The Comparison of various acoustic databases makes it possible to formulate some mandatory requirements to a modern phonetic database designed for fundamental and applied researchers. SLDBs for applied researchers, particularly in the area of speech synthesis and recognition, shall provide solutions for the following tasks [Bolshakova, Potapova, Lobanov, 2013; Glavatskin, Platonova, Rogozhina et al., 2015; Popli, Kumar, 2017]:

- Adding audio standards to SLDBs — digitized speech records of standard speakers in various speech styles, from spontaneous speech and text read on the basis of speech transcriptions, to reading a list of words. In other words, this SLDB shall include audio material with maximum variation in representation of linguistic units (phonemes and intonation structures) in various conditions of human speech.

- Adding segment information and detailed phonetic description of included audio standards, as it is necessary to provide a detailed description of the material: the address of sounds boundaries and intonation units, syllables and word forms, as there are various methods of speech recognition and synthesis in regard to basic units, as well as a detailed phonetic and phonemic transcription.

- Ensuring efficient SLDB querying to find audio fragments needed by their transcriptional descriptions and attributes specified in the description.

A phonetician creates "sound-letter" transition rules, and sounds are represented by the special alphabet Sampa (Speech Assessment Methods Phonetic Alphabet). Speech databases created within SpeechDat projects meet the following requirements to: cover phonetically representative words, commands, phrases, numbers, digits, digit sequences, phonetically representative utterances; represent various styles of pronunciation (commands, reading, spontaneous speech, etc.); capture acoustic environment; be suitable for development and training of robust speech recognition systems, etc. implement voice driving of robots; analyze voice and speech signals for forensic purposes [Potapova, Andreyev, Lednov, Lyutova, Bobrov, 2012; Potapova, Andreyev, Lednov, Rogozhina, Bobrov, 2012; Potapova, Bobrov, 2015].

Some examples of SLDBs are given. The task was to form a SLDB of Arabic represented by audio-texts. The primary corpus of the database is represented by fragments of Arabic speech that were to be segmented by native listeners at pre-phrase and phonemic levels depending on the perceptual-auditory specifications of the task. Recordings were made with

the help of the programs Cool Edit 2000 and Real Player Plus 8.0. A big fuzzy data of records of various texts obtained using the material on tapes (digitization conditions: 22050 Hz, 16 bit, mono). The texts included monologues, dialogues, polylogues performed by men and women — native speakers. Other records were news reports taken from various Internet portals. The task of the SLDB for Arabic was to form a phonetic database ofArabic represented by audio-texts. The database was developed under the project "Multi-purpose multilingual corpus linguistics". The preliminary total volume of the SLDB audio material was 10 GB. 80 speakers' voices (men and women) were represented, who were Arabic speakers with diverse pronunciation variants. The SLDB included 2,070 pairs of files (audio material/text in spelling and transcription). Audio-files were digitized in WAV format (Microsoft Wave). This representation makes it easy to search and match material, as well as to incorporate information in any automated speech systems, which corresponds to the task of forming a multipurpose SLDB. In some cases compressed audio-files (e.g., WMA-8, FM Radio quality) were used as raw material, which subsequently were also converted to WAV format with the above parameters.

The SLDB for Arabic was formed as follows: the audio-material was repeatedly played in full to listeners — specialists in Arabic, as well as specialists in the field of experimental phonetics; each played audio-file was transcribed using the transcription system adopted in the international information network (SAMPA for Arabic); in the process of the perceptual-auditory analysis and transcription files were used that had been compiled at the first stage of the research containing authentic Arabic material spoken by native speakers (men and women); during the perceptual-auditory analysis some material was sorted out due to the presence of speech signal noise; in parallel with the above mentioned segmentation acoustic segmentation (using CoolEdit 1.0 and Sound Forge 4.5c software) and macro-textual segmentation (using Microsoft Word processor) was performed; the resulting speech segments (© = 2070) were subjected to an additional perceptual-auditory analysis to confirm the accuracy of compliance between the acoustic and macro-textual information within each segment; an SLDB was formed; the segmented material (audio and text files) were recorded on optical media (CD-ROM); all the fragments included in the SLDB were transcribed using the international universal phonetic alphabet SAMPA (Sampa: http://www. phon.ucl.ac.uk/home/sampa/home.htm; Sampa: http://en.wikipedia. org/wiki/SAMPA) with indication of vowels length [Potapova, Potapov, Bazhenova, 2015].

The transcriptional records also contain some information about the grammatical and morphological structure of words (articles are shown that underwent phonetic assimilation), as this information can be helpful in

the future when using the SLDB in order to study the phonetic variations of Arabic speech. Similar SLDBs were developed for French, Turkish, Lithuanian, Polish, Chechen, and some languages of ethnic minorities of Russia (e.g., [Potapova, Andreyev, Lednov, Lyutova, Bobrov, 2012; Potapova, Andreyev, Lednov, Rogozhina, Bobrov, 2012; Potapova, Lebedev, Bobrov, 2011; Potapova, Potapov, Abramov, Khitina, Bobrov, Maslov, 2011; Potapova, Potapov, Bobrov, 2011; Potapova, Sharov, Bobrov, 2011]).

The Arabic SLDB [Potapova, Lebedev, Bobrov, 2011] was used for development of the research Cloud Technology Stand- alone System [Potapova, Bazhenova, Potapov, Bobrov, 2015; Potapova, Potapov, Bazhenova, 2015]. The Windows Azure platform offers various services of data storage allowing placement of data in a reliable scalable data storage in a cloud, and supports two types of data storage: Windows Azure Storage to store tables, big objects and queues; and SQLAzure as a full-function DB. Using SQL Azure as a base for SLDB formation provides the following essential advantages: simplicity of installation and deployment; no expenses on infrastructure service; instead of licenses for the server or processors, as for MS SQL Server, it is possible to use more flexible options of payment upon actual resources consumption; wide availability, both from any point, and using a wide range of supported technologies; high fault tolerance ensured by triple replication of data; management and recovering within a cloud; flexible scalability; support of a relational data model; interaction between analogy and MS SQL Server, use of T-SQL.

To provide a solution for the problem of replenishment ofthe integrated relational Spoken Language Big Data (SLBD) with information from hierarchically organized storages, a program was developed that provides conversion of hierarchical SLBDs into integrated relational SLBDs. This program provided solutions for the following problems: to form a hierarchical data model for linguistic information from SLBDs; to add linguistic information for SLBDs in the form of hierarchical structures, using both predetermined structures and the possibility to specify any hierarchical structure using a subset of the xml document representation standard; to convert hierarchically organized SLBDs into relational DBs.

The converting program provides conversion of hierarchically organized data with description of information for SLBDs into a format of SQL language (Structured Query Language) operators. Using hierarchical structures in which information for SLBDs is originally described, it is possible to create a set of SQL operators forming tables of a relational DB (Fig. 1).

Further, after formation of a conceptual data model created by the integrated SLDB, and creation of representative retrieval of tested data, the problem of data configuration appears. The research stand-alone

system was tested by means of Microsoft Windows Azure, and SQL Azure platform will make it possible to test connection of the cloud infrastructure to the local SLDB center.

Figure 1. Scheme of the research stand [Potapova, Bazhenova, Potapov, Bobrov, 2015]

CONCLUSIONS

In conclusion, it should be emphasized that fuzzy language resources are a crucial component in the development and operation of various information systems that implement linguistic functions aimed at natural language processing in its various manifestations (for printed and handwritten texts, and for spoken language).

In the field of corpus linguistics the modern computer technologies accelerate and simplify procedures for linguistic processing of large volumes of texts in their system based on text resources in a certain language in electronic form provided with specific additional information about properties of this linguistic material.

The most part of linguistic information in WLDBs and SLDBs is formed as hierarchical structures.

Linguistic information for WLDBs and SLDBs in the form ofhierarchical structures can be created as xml-data and displayed in the form of a tree. The converting program allows to transform hierarchically organized data with the description of information for speech databases to a format

of the SQL language (Structured Query Language) operators. It makes it possible, on the basis of hierarchical structures in which information for speech databases is originally described, to create a set of SQL operators forming tables of a relational database.

This program provides solutions to the following problems: to form hierarchical model of data for linguistic information of written and spoken databases; to add linguistic information for databases in the form of hierarchical structures using predetermined structures, and possibility of a task of any hierarchical structure on the basis of use of a subset of the standard of representation of xml of the document; to make transformation of hierarchical organized speech databases to relational databases.

Our investigations can be planned not only in the domain of multilingual databases, but also in the domain of multimodal polycode (verbal, paraverbal, and non-verbal) ones for structured or weakly structured big databases.

Thus, the analysis of fuzzy big data is a rapidly developing Big Data industry designed both for scientific researches and for solving a number of applied problems.

References

1. Ahlbom, A., Bridges, J. et al. Opinion on the scientific aspects of the existing and proposed definitions relating to products of nanoscience and nanotechnologies. Opinion adopted by SCENIHR (Scientific Committee on Emerging and Newly Identified Health Risks) at the 21st plenary. Brussels, 2007.

2. Andryushchenko, V.M. Kontseptsiya i arhitektura mashinnogo fonda russkogo yazyka. [The concept and architecture of the Russian language computer foundation] Moskva: Nauka, 1989.

3. Arefyev, N., Panchenko, A., Lukanin, A.V., Lesota, O.O., Romanov, P.V. Evaluating three corpus-based semantic similarity systems for Russian. In: Dialog 28. 17. Moscow, 2015.

4. Bogdanova-Beglarian, N., Martynenko, G., Sherstinova, T. The "one day of speech" corpus: phonetic and syntactic studies of everyday spoken Russian. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNAI 9319. Springer International Publishing, 2015, pp. 429-437.

5. Bolshakova, M., Potapova, R., Lobanov, V. Baza dannyh nemetsko-russkogo intellektual'nogo elektronnogo slovarya po mehatronike i robototehnike. [Database of the German-Russian intellectual electronic dictionary on mechatronics and robotics] RIA certificate No 2013620720. Moskva, 2013

6. Glaeser, R. Linguistic features and genre profiles of scientific English. Leipziger Fachsprachen-Studien 9. Lang, Frankfurt/Main, 1993.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

7. Glavatskin, I., Platonova, T., Rogozhina, V. et al. The multi-level approach to speech corpora annotation for automatic speech recognition.

In: Ronzhin, A, Potapova, R., Fakotakis N. (eds.) SPECOM 2015. LNAI 9319. Springer International Publishing, 2015, pp. 438-445.

8. Juzova, M., Tihelka, D., Matousek, J. Designing high-coverage multi-level text corpus for non-professional-voice conservation. In: Ronzhin, A., Potapova, R., Nemeth G. (eds.) SPECOM 2016. LNAI 9811 Springer International Publishing, 2016, pp. 207-215.

9. Korpusnaya lingvistika v Rossii [Corpus linguistics in Russia] (compiled by E.V. Rahilina, S.A. Sharov). Scientific and Technical Information, ser. 2: Information Processes and Systems, N 5-6. Moskva, 2003.

10. Krivnova, O.F., Zakharov, L.M., Strokin, T.S. Rechevye korpusy (opyt razrabotki i ispol'zovanie) [Speech corpora (elaboration experience and applications)]. / Dialog-2001. http://www.dialog-21. ru/digest/2001/articles/krivnova/Moskva, 2001.

11. Melchuk, I.A. Opyt teorii lingvisticheskih modelej. Smysl-tekst: semantika, sintaksis. [Experience of the linguistic modeling theory. Sense-text: semantics, syntax] Moskva, 1999.

12. Popli, A., Kumar, A. Multimodal keyword search for multilingual and mixlingual speech corpus. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNAI 10458. Springer International Publishing, 2017, pp. 535-545.

13. Potapova, R. Multilingual spoken language databases in Russia. In: Potapova, R., Ronzhin, A. (eds.). Speech and computer 2011, Kazan, 2011, pp. 13-17.

14. Potapova, R. Osnovnye tendentsii razvitiya mnogoyazychnoj korpusnoj lingvistiki. [The main trends in the development of multilingual corpus linguistics] Speech technology, 2-3, 92-114, 93-112. Moskva, 2009.

15. Potapova, R. Phonetische Datenbasen als Grundlage der modernen Sprechtechnologien. In: Bose, I., Neuber, B. (Hg.) Sprechwissenschaft: Bestand, Prognose, Perspective, vol. 51. Lang, Frankfurt/Main, 2014, pp. 191-198.

16. Potapova, R., Andreyev, M., Lednov, D., Lyutova, D., Bobrov, N. Transkribirovannaya rechevaya baza dannyh dlya pol'skogo yazyka. [Transcribed speech database for the Polish language] Authorship certificate No. 2012620697. Moskva, 2012.

17. Potapova, R., Andreyev, M., Lednov, D., Rogozhina, V., Bobrov, N. Transkribirovannaya rechevaya baza dannyh dlya pol'skogo yazyka. [Transcribed speech database for the Lithuanian language]. Authorship certificate No. 2012620696. Moskva, 2012.

18. Potapova, R., Bazhenova, I., Potapov, V., Bobrov, N. Programma-konverter ierarhicheskih rechevyh baz dannyh v relyatsionnye integrirovannye rechevye bazy dannyh. [Program-converter of hierarchical speech databases into relational integrated speech databases] Authorship certificate No. 2015611362. Moskva, 2015.

19. Potapova, R., Bobrov, N. Versatile linguistic database annotation: practical issues and a new flexible approach. In: Fakotakis, N., Ronzhin, A., Potapova, R. (eds.). Speech and Computer 2015. Proceedings, vol. II. Patras, 2015, pp. 45-53.

20. Potapova, R., Efremenko, N. Terminologicheskaya baza dannyh po rechevedeniyu (nemetsko-russkie sootvetstviya) [Terminology database on speech studies (German-Russian equivalents). Authorship certificate No. 2012620716. Moskva, 2012.

21. Potapova, R., Lebedev, V., Bobrov, N. Transkribirovannaya rechevaya baza dannyh dlya arabskogo yazyka. [Transcribed speech database for the Arabic language] Authorship certificate No. 2011620788. Moskva, 2011.

22. Potapova, R., Potapov, V. Polybasic attribution of social network discourse. In: Ronzhin, A., Potapova, R., Nemeth, G. (eds.). SPECOM 2016. LNAI 9811. Springer International Publishing, 2016, pp. 539-546.

23. Potapova, R., Potapov, V., Abramov, Yu., Khitina, M., Bobrov, N., Maslov, A. Ustno-rechevaya baza dannyh dlya russkogo yazyka. [Spoken language database for Russian] Authorship certificate No. 2011620790. Moskva, 2011.

24. Potapova, R., Potapov, V., Bazhenova, I. Development of the research cloud technology stand-alone system (regarding integrated speech databases). In: Fakotakis N., Ronzhin A., Potapova R. (eds.). Speech and Computer 2015. Proceedings, vol. II., pp. 1-7. Patras, 2015.

25. Potapova, R., Potapov, V., Bobrov, N. Transkribirovannaya rechevaya baza dannyh dlya chechenskogo yazyka. [Transcribed speech database for the Chechen language] Authorship certificate No. 2011620787. Moskva, 2011.

26. Potapova, R., Potapov, V., Rakhimberdiev, B. Baza dannyh,prednaznach-ennaya dlya diagnostiki fizicheskogo i emotsional'nogo sostoyaniya cheloveka s uchetom rechevyh i golosovyh harakteristik. [Database designed to diagnose the physical and emotional state of a person with regard to speech and voice characteristics] Authorship certificate No. 2011620789. Moskva, 2011.

27. Potapova, R., Sharov, M., Bobrov, N. Transkribirovannaya rechevaya baza dannyh dlya turetskogo yazyka. [Transcribed speech database for Turkish]. Authorship certificate No. 2011620786. Moskva, 2011.

28. Russian National Corpus: www.ruscorpora.ru

29. Sampa: http://en.wikipedia.org/wiki/SAMPA

30. Sampa: http://www.phon.ucl.ac.uk/home/sampa/home.htm

31. Schuller, B.W. Big data, deep learning — at the edge of X-ray speaker analysis. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNAI 10458. Springer International Publishing, 2017, pp. 20-34.

32. Sharov, S.A. Predstavitel'nyj korpus russkogo yazyka v kontekste mirovogo opyta. [Representative Russian language corpus in the context of worldwide experience] Nauchno-tehnicheskaja informatsiya [The Journal of Scientific and Technical Information], ser. 2: Information Processes and Systems, 6. Moskva, 2003, pp. 9-17.

33. Shevchenko, T., Pozdeeva, D. Canadian English word stress: a corpora-based study of national identity in a multilingual community. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM2017. LNAI 10458. Springer International Publishing, 2017, pp. 221-232.

34. Shirokova, A., Platonova, T., Smolina, A., et al. Russian corpora for speech technology applications: purposes and tools for build-up. In: Fakotakis N., Ronzhin A., Potapova R. (eds.). Speech and Computer 2015. Proceedings, vol. II. Patras, 2015, pp. 63-70.

35. Valenta, T., Smidl, L. WebTransc — a www interface for speech corpora Production and processing. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNAI, vol. 9319. Springer International Publishing, 2015, pp. 487-494.

36. Wen, K., Tan, S., Wang, J., Li, R., Gao, Y. A model dased transformation paradigm for cross-language collaborations. Adv. Eng. Inform. 27 (1), 2013, pp. 27-37.

Р.К. Потапова, В.В. Потапов

НЕКОТОРЫЕ МЕТОДЫ РАЗРАБОТКИ МНОГОЯЗЫЧНЫХ ЛИНГВИСТИЧЕСКИХ БАЗ ДАННЫХ ПРИМЕНИТЕЛЬНО К УСТНОЙ И ПИСЬМЕННОЙ РЕЧИ2

Федеральное государственное бюджетное образовательное учреждение высшего образования Московский государственный лингвистический университет 119034, Москва, ул. Остоженка, 38

Федеральное государственное бюджетное образовательное учреждение высшего образования Московский государственный университет имени М.В. Ломоносова 119991, Москва, Ленинские горы, 1

В статье освещены некоторые методы разработки лингвистических баз данных применительно к устной и письменной речи, развиваемые исследовательскими коллективами в России. Рассматриваемая проблемная область представлена как в историческом ракурсе, так и с позиции современных тенденций развития исследовательских методов и методик аннотирования. Формирование письменно-речевых баз данных рассмотрено на примере исследования русско-английских терминологических соответствий в области современных нанотехнологий. Применительно к многоязычным устно-речевым базам данных авторы рассматривают акустические методы автоматизированной сегментации и транскрибирования речевого потока. Вербальный контент письменно-речевых и устно-речевых баз данных включает единицы русского и английского языков, а также (в части устно-речевых баз данных) некоторых языков стран бывшего СССР применительно к задаче интеграции упомянутых баз данных с современными облачными технологиями.

Ключевые слова: лингвистические базы данных; устно-речевые базы данных; терминологические базы данных; нанотехнологии; облачные технологии.

2 Исследование проводится при поддержке Российского научного фонда, проект № 18-18-00477 (научный руководитель — д-р филол. наук, проф. Р.К. Потапова).

Сведения об авторах: Потапова Родмонга Кондратьевна — доктор филологических наук, профессор, директор Института прикладной и математической лингвистики ФГБОУ ВО МГЛУ (e-mail: RKPotapova@ yandex.ru); Потапов Всеволод Викторович — доктор филологических наук, старший научный сотрудник филологического факультета МГУ имени М.В. Ломоносова (e-mail: [email protected]).

Список литературы

1. Большакова М., Потапова Р., Лобанов В. База данных немецко-русского интеллектуального электронного словаря по мехатронике и робототехнике. Свидетельство о регистрации РИД № 2013620720. М., 2013.

2. Концепция и архитектура машинного фонда русского языка / В.М. Андрющенко; Отв. ред. А.П. Ершов; АН СССР, Ин-т рус. яз. М., 1989.

3. Корпусная лингвистика в России (сост. Рахилина Е.В., Шаров С.А.). Сб. научных трудов / Научно-техническая информация, серия 2: Информационные процессы и системы. № 5—6. М., 2003.

4. Кривнова О.Ф., Захаров Л.М., Строкин Т.С. Речевые корпусы (опыт разработки и использование) / Диалог-2001. М., 2001.

5. Мельчук И.А. Опыт теории лингвистических моделей. Смысл-текст: семантика, синтаксис. М., 1999.

6. Потапова Р.К. основные тенденции развития многоязычной корпусной лингвистики // Речевые технологии, № 2—3, 92—114, 93— 112. М., 2009.

7. Потапова Р.К., Андреев М.Ю., Леднов Д.А., Лютова Д.А., Бобров Н.В. Транскрибированная речевая база данных для польского языка. Свидетельство о регистрации РИД № 2012620697. М., 2012.

8. Потапова Р.К., Андреев М.Ю., Леднов Д.А., Рогожина В.С., Бобров Н.В. Транскрибированная речевая база данных для литовского языка. Свидетельство о регистрации РИД № 2012620696. М., 2012.

9. Потапова Р.К., Баженова И.Ю., Потапов В.В., Бобров Н.В. Программа конвертер иерархических речевых баз данных в реляционные интегрированные речевые базы данных. Свидетельство о регистрации РИД № 2015611362. М., 2015.

10. Потапова Р.К., Ефременко Н.В. Терминологическая база данных по речеведению (немецко-русские соответствия). Свидетельство о регистрации РИД № 2012620716. М., 2012.

11. Потапова Р. К., Лебедев В.Г., Бобров Н.В. Транскрибированная речевая база данных для арабского языка. Свидетельство о регистрации РИД № 2011620788. М., 2011.

12. Потапова Р.К., Потапов В.В., Абрамов Ю.В., Хитина М.В., Бобров Н.В., Маслов А.В. Устно-речевая база данных для русского языка. Свидетельство о регистрации РИД № 2011620790. М., 2011.

13. Потапова Р.К., Потапов В.В., Бобров Н.В. Транскрибированная речевая база данных для чеченского языка. Свидетельство о регистрации РИД № 2011620787. М., 2011.

14. Потапова Р.К., Потапов В.В., Рахимбердиев Б.Н. База данных, предназначенная для диагностики физического и эмоционального состояния человека с учетом речевых и голосовых характеристик. Свидетельство о регистрации РИД № 2011620789. М., 2011.

15. Потапова Р.К., Шаров М.С., Бобров Н.В. Транскрибированная речевая база данных для турецкого языка. Свидетельство о регистрации РИД № 2011620786. М., 2011.

16. Шаров С.А. Представительный корпус русского языка в контексте мирового опыта // НТИ, Сер. 2. 2003. № 6. С. 9-17.

17. Ahlbom, A., Bridges, J. et al. Opinion on the scientific aspects of the existing and proposed definitions relating to products of nanoscience and nanotechnologies. Opinion adopted by SCENIHR (Scientific Committee on Emerging and Newly Identified Health Risks) at the 21st plenary. Brussels, 2007.

18. Arefyev, N., Panchenko, A., Lukanin, A.V., Lesota, O.O., Romanov, P.V. Evaluating three corpus-based semantic similarity systems for Russian. In: Dialog 28. 17. Moscow, 2015.

19. Bogdanova-Beglarian, N., Martynenko, G., Sherstinova, T. The "one day of speech" corpus: phonetic and syntactic studies of everyday spoken Russian. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNAI 9319. Springer International Publishing, 2015. P. 429-437.

20. Glaeser, R. Linguistic features and genre profiles of scientific English. Leipziger Fachsprachen-Studien 9. Lang, Frankfurt/Main, 1993.

21. Glavatskin, I., Platonova, T., Rogozhina, V. et al. The multi-level approach to speech corpora annotation for automatic speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis N. (eds.) SPECOM 2015. LNAI 9319. Springer International Publishing, 2015. P. 438-445.

22. Juzova, M, Tihelka, D., Matousek, J. Designing high-coverage multi-level text corpus for non-professional-voice conservation. In: Ronzhin, A., Potapova, R., Nemeth G. (eds.) SPECOM 2016. LNAI 9811 Springer International Publishing, 2016. P. 207-215.

23. Popli, A., Kumar, A. Multimodal keyword search for multilingual and mixlingual speech corpus. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNAI 10458. Springer International Publishing, 2017. P. 535-545.

24. Potapova, R., Bobrov, N. Versatile linguistic database annotation: practical issues and a new flexible approach. In: Fakotakis, N., Ronzhin, A., Potapova, R. (eds.). Speech and Computer 2015. Proceedings, vol. II. Patras, 2015. P. 45-53.

25. Potapova, R., Potapov, V, Bazhenova, I. Development of the research cloud technology stand-alone system (regarding integrated speech databases). In: Fakotakis N., Ronzhin A., Potapova R. (eds.). Speech and Computer 2015. Proceedings. Vol. II.P. 1-7. Patras, 2015.

26. Potapova, R., Potapov, V. Polybasic attribution of social network discourse. In: Ronzhin, A., Potapova, R., Nemeth, G. (eds.). SPECOM 2016. LNAI 9811. Springer International Publishing, 2016. P. 539-546.

27. Potapova, R. Multilingual spoken language databases in Russia. In: Potapova, R., Ronzhin, A. (eds.). Speech and computer 2011, Kazan, 2011. P. 13-17.

28. Potapova, R. Phonetische Datenbasen als Grundlage der modernen Sprechtechnologien. In: Bose, I., Neuber, B. (Hg.) Sprechwissenschaft: Bestand, Prognose, Perspective. Vol. 51. Lang, Frankfurt/Main, 2014. P. 191-198.

29. Russian National Corpus: www.ruscorpora.ru

30. Sampa: http://en.wikipedia.org/wiki/SAMPA

31. Sampa: http://www.phon.ucl.ac.uk/home/sampa/home.htm

32. Schuller, B.W. Big data, deep learning — at the edge of X-ray speaker analysis. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNAI 10458. Springer International Publishing, 2017. P. 20-34.

33. Shevchenko, T., Pozdeeva, D. Canadian English word stress: a corpora-based study of national identity in a multilingual community. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNAI 10458. Springer International Publishing, 2017. P. 221-232.

34. Shirokova, A., Platonova, T., Smolina, A., et al. Russian corpora for speech technology applications: purposes and tools for build-up. In: Fakotakis N., Ronzhin A., Potapova R. (eds.) Speech and Computer 2015. Proceedings, vol. II. Patras, 2015. P. 63-70.

35. Valenta, T., Smidl, L. WebTransc — a www interface for speech corpora Production and processing. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNAI. Vol. 9319. Springer International Publishing, 2015. P. 487-494.

36. Wen, K., Tan, S., Wang, J., Li, R., Gao, Y. A model dased transformation paradigm for cross-language collaborations. Adv. Eng. Inform. 27 (1), 2013. P. 27-37.

i Надоели баннеры? Вы всегда можете отключить рекламу.