ГЕНЕРАЦИЯ СВЯЗНОГО ТЕКСТА. РАЗБОР НЕЙРОСЕТЕВЫХ МЕХАНИК. МЕХАНИКА ПЕРВАЯ - ЯЗЫКОВАЯ МОДЕЛЬ КАК СРЕДСТВО РАБОТЫ С ЯЗЫКОМ

Гринин И.Л.

Генерация связного текста. Разбор нейросетевых механик. Механика первая - языковая модель как средство работы с языком.

Гринин Игорь Леонидович

магистрант, кафедра программного обеспечения автоматизированных систем (ПОАС), Волгоградский Государственный Технический Университет (ВолгГТУ), frederickbrown@yandex.ru

Данная статья является первой в серии из трех статей, посвященных разбору работы механик нейросетевых моделей генерации связного текста. В этой статье рассматривается работа первой из трех механик модели генерации связного текста -языковой модели. Методами исследования являются сравнительный анализ нескольких самых частых в использовании языковых моделей, среди которых есть как использующие технологии нейронных сетей, так и неиспользующие. Для каждой из представленных моделей был произведен полный анализ работы. Итогом исследования моделей является эксперемен-тальное сравнение трех языковых моделей или же моделей обработки естественного языка. Все три модели являются часто-используемыми на практике.В процессе эксперимента была создана таблица, в которую были внесены данные, полученные во время исследований. Для каждой модели были выписаны характеристики и возможности модели, которым экспертным путем были присвоены числовые значения. Каждая из возможностей и характеристик моделей описана для максимально качественного оценивания. Также был получен ряд теоретических знаний, для работы с текстом, которые могут стать полезны для различных возможных обработок текстовых данных. Ключевые слова: анализ текста, векторное представление слов, программирование, обучение нейронных сетей, языковая модель, текстовые вхождения

Introduction

The authors study the work of a model for generating a coherent text. Because most of the ways of interaction, including voice assistants, robots, answering machines, text generators, etc. begins with the compilation, text generation is one of the key problems in the technology of intelligent information processing.

At the moment, in the scientific literature on each of the modules there is a large amount of information [1-4]. Analysis of literature shows that the topic of NLP modeling is very popular, a total of about 100,000 articles have been published, with about 17,000 articles published in 2019 alone. This shows the growing interest in this topic. However, this is a lack of research into models as components of a single system.

This article is theoretical, but some of the practical results of language models have been shown in our previous work.

In the article, one of the three modules of work is considered - the language model.

The language model works with the so-called "word embeddings" - words, or their parts, used in the system for analysis.

A language model is a probabilistic distribution on multiple dictionary sequences.

The assessment of the quality of the language model is the indicator of perplexity - an indicator of the effectiveness of predicting the parts of the test collection (the less perplexcy, the better the model).

An example of the simplest language model is presented in the figure below:

о

СЧ

о

сч

О Ш

m

X

<

m О X X

Fig. 1 Main components of the language model

One picture shows the simplest language model that accepts the letter values. From the composed letters, it receives the word, and then calculates the probabilistic location of the next word after it. A dark blue rectangle

denotes a module that, having accepted an incoming set of letters and displays its vector representation. The yellow rectangle shows the module that receives a view of the word state vector. He calculates the distribution of the next word. The last green layer determines the position of the words relative to all other words in the dictionary[1] .

most part are very close. Special symbols and punctuation also form their own separate clusters. However, there is no definite pattern in the arrangement of symbols in space. This applies to all dimensions of space, that is, if the regularity is absent in the two-dimensional representation, then it will not be in the 16-dimensional representation either.

Symbolic embedding

The simple way to represent a character as an input data is by direct encoding, or one-hot encoding. To implement this method, it is necessary to present the entire incoming corpus of characters (for example, the alphabet) in the form of a binary array. Then for each character, this array will look like a unit standing at the position of the character in the case, and the zeros surrounding it [6]. Example 1

onehot('a') =

[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

onehot('c') =

[0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

So in direct coding look the letters a u c from the Latin alphabet. However, this method is too crude and takes a lot of resources. Instead, the so-called "dense" distribution of symbols is used. The blue "CNN" block from the representation of our model is responsible for translating not yet processed characters in the vector representation and their subsequent transfer to the next module.

Since the model has a dimension of 256 (non-ASCII characters expanded to several bytes, each of which is encoded separately), and is displayed on a dimension 16, the example character will look like this.

array ([ 1.10141766, -0.67602301, 0.69620615, 1.96468627, 0.84881932, 0.88931531, -1.02173674, 0.72357982, -0.56537604, 0.09024946,-1.30529296, -

Vector representation

Vector representation words are the main way to work with text embeddings. There are a lot of variations of these models, but in this article we will look at examples of vectorization work on the most popular model, and then draw a small comparison with others. It should be clarified that even with a large number of variations of language models and vectorization technologies, the basic principle of their work remains very similar.

We will consider the operation of the vector representation using the most commonly used word2vec model as an example. An entry vector is a set of characteristics or descriptions and their meanings specific to any word. Below for figure we will give a standard example of Google, the creators of this model.

0.76146501, -0.30620322, 1.02123129], dtype=float32) Example 2.

If to present the given model in dimension equal to two, then pairs with the smallest distance in the previous 16-dimensional representation as will be in it closer to each other. This is shown in the figure below.

0.54770935, -0.74167275, Figure - 4 words Vector representation in word2vec

In figure you can see that in the meanings, the words man and woman are closer to each other than to the word king. However, for example, the word "king" conditionally represents the sum of such vectors as "man" + "ruler". And there's a word for "queen." This is the same vector "ruler", but now folded with a "woman." Based on simple logic, if we replace the vector of "man" in the word "king" with "woman", then we will get a queen. And that's how, in a simple statement, the vector representation of words that is shown in the pic works. 5.

Figure 3 Representation of the symbols model of the model in 2-dimensional space

In the figure, you we can see that, all the numbers are in the same area, and the lowercase and capital letters for the

Figure - 5 words Vector representation in word2vec

It can be seen on figure that the conditional meaning of the "gender" vector, indicated by the king and the man in light blue, in the word "queen" has the same color meaning as in the word "woman". However, this is an example where

X X

o

0D A C.

X

0D m

o

ho o ho o

if you don't know exactly what's encoded, it's possible to assume. In real examples, it looks like that the vector consists of much more items. Let's take this as an example.

The following is the attachment for the word "king" (a Wikipedia-trained vector):

[ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961 , -0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.16754 , 0.34344 , -0.25663 , -0.8523 1.1685 , -1.0137 , -0.21585 , -0.15155 , -1.6106 , -0.64426 , -0.51042 ] As we can see, there are 50 attachments, but in terms of human perception they are not of importance. For ease of presentation, we visualize these attachments:

-1.0783 , -0.34354 , 0.71125 , 0.49159 , 0.1661 , 0.40102 , 0.78321 , -0.91241

As can be seen in the picture, the addition of the vectors of the king and the woman is not absolutely equal to the representation of the queen. However, this is the closest value out of about 400,000 attachments in that dataset from where the value data is provided. Special libraries, for example Gensim, are engaged in creating analogies and searching for word-meanings closest to the output vector. [6] The figure below shows a set of some other vectors for comparison and analysis of the vector entry model.

queen

Figure - 9 GloVe Vector Attachment Visualization

o

CS

o

CS

O HI

m

X

3

<

m

o

X X

Fig. - 6 Visualization of vector attachments

Next, consider the same word vectors as in our simplified example.

Figure - 7 Visualization of attachments of vectors

Here, as before, the words "man" and "woman" correspond more to each other than to "king".

Let's move on to the folded vectors. Additions, like subtractions, like all other results of operations with vectors, are called analogies. Analogies give new meanings of words, depending on the actions. However, this does not mean that if we add up 2 vectors we will definitely get a third, with the value that we need. We'll get an approximate, very close value. Maybe even identical, but the chance is quite small. The example below shows how this works.

king - man + woman queen

king man woman king-man+woman queen |

Figure - 8 Image caption Vector attachments

C li

In the figure, you can notice several features:

1. One red column runs through all the words. That is, these words are similar in this particular dimension (at the same time, it is not known what is encoded in it).

2. You can see the similarities between "woman" and "girl", similarly in the case of "man" and "boy".

3. "Boy" and "girl" are similar in some dimensions, but differ from "woman" and "male.".

4. There are clear dimensions where the "king" and "queen" are similar to each other and different from all the others.

Model Comparison

By the example of the description of a vector model the basic principles of construction and interaction of occurrences, as well as mathematical bases of language models were demonstrated.

As mentioned above, we conducted a small comparative analysis of popular language models. The most popular example of which was used in the article is word2vec. Also, two more GloVE and fastText models were chosen for comparison.

Briefly about each model:

word2vec. This model works on large cases (samples) of texts, which allows to determine the relationship of the forms of words to each other (for example, gender, such as "king" and "queen"). With help of these connections vectors built themselves.

The main advantage of this model is its methodology, which serves to develop new models

GloVe (Global Vectors)

This model was created at about the same time as the previous one, so very often parallels are drawn between them. Unlike word2vec, GloVe works with coincidence statistics - it minimizes the standard deviation, gives out the space of the word vector with a reasonable substructure [7]. This allows you to work fairly accurately with the vector substructure and vectors in general. Different vectors of words can be linked, such as language and all its dialects.

The main advantage of this model is that it complements word2vec, adding the frequency of occurrence of words.

FastText.

This model, designed for facebook, like many others, is a continuation of word2vec. The main difference here is that instead of words, the occurrences are the so-called n-grams - parts of these words. The most popular value for n is 3, which forms trigrams. The main advantage of this model is that it works well with rare or unknown words.

As a form of analysis, a comparative table was chosen, with a numerical description of the capabilities of each of the described models. In comparative experiments on the main capabilities of the model, we obtained the results presented in the table in the form of an expert assessment on the ten-point system. Estimates were given to each of the models:

Table 1

Simplici Lear The Work Offer Ignorin

ty of ning sema with level g co-

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

architec spee ntic rare work occurre

ture d load occurre nce

of nces

vector s

Word2 4 7 8 3 2 3

vec

GloVe 3 9 7 2 1 2

fastTe 5 8 - 8 2 2

xt

Results and Conclusions

As can be seen from the table, our comparative analysis of the models showed the absence of a clear dominance of one model.

The best result was shown by the word2vec model, with an absolute value of 27 points. However, the difference between the models is small and is in the region of 10% (11% and 7.5% of the maximum number of points). However, the difference between the models is small and is in the region of 10% (11% and 7.5% of the maximum score). For example, the GloVe model showed the best result on the "learning speed" option (9 points), however, has the worst result when working with rare entries (2 points). The word2vec model has the best average result, with the smallest deviations in deviations for individual parameters. This can be explained by the fact that it is a basic model developed earlier than others. High performance on individual indicators of other models, justified by the fact that these models are an improvement of word2vec for individual areas of work.

Each model, in accordance with its advantages, should be applied in the context of working with the analysis and processing of texts in natural languages for problems of language modeling. Unfortunately for models, this point is often overlooked, resulting in suboptimal results. This once again shows the importance of choosing the optimal language model.

The generation of coherent texts. analysis of neural network mechanics. the mechanics of the first - language model as a tool for working with the language

Grinin I.L.

Volgograd State Technical University (VSTU)

This article is the first in a series of three articles devoted to the analysis of the mechanics of neural network models for generating connected text. This article discusses the work of the first of the three mechanics of the connected text generation model - the language model. The research methods are a comparative analysis of several of the most frequently used language models, among which there are both using neural network technologies and non-using ones. For each of the presented models, a full analysis of the work was performed. The result of the model study is an experimental comparison of three language models or natural language processing models. All three models are frequently used in practice.During the experiment, a table was created in which the data obtained during the research was entered. For each model, the characteristics and capabilities of the model were written out, which were assigned numerical values by expert methods. Each of the features and characteristics of the models is described for maximum quality evaluation. We also obtained a number of theoretical knowledge for working with text, which can be useful for various possible processing of text data.

Keywords: text analysis, vector representation of words, programming, neural networks' training, nlp model, text embeddings

References

1. Frankovsky M., Birknerová Z., Stefko R., Benková E.Implementing the concept of neurolinguistic programming related to sustainable human capital development

2. Sustainability. 2019. T. 11. № 15. C. 4031.

3. Khashkovsky AV. Neurolinguistic programming as an advertising tool Marketing and Sales Director. 2015. № 8. C. 6368.

4. Morkovkin AG, Popov AA. Application of the universal language model fine-tuning method for the task of classification of intentions // In the book: Science. Technology. Innovation Collection of scientific papers. In 9 parts. Edited by A.V. Gadyukina. 2019. C. 168-170.

5. Jesus L. Lobo. Evolving Spiking Neural Networks for online learning over drifting data streams / Jesus L. Lobo, Ibai Laña, Javier Del Ser, Miren Nekane Bilbao, Nikola Kasabov // Neural Networks. - 2018. - No. 108. - S. 1-19.

6. Development, testing and comparison of models of sentimental

7. analysis of short texts, I. Grinin, "Innovations and Investments" No. 6, p. 186-190

8. Development, testing and comparison of models of sentimental analysis of short texts, I. Grinin, "Innovations and Investments" No. 6, p. 186-190

9. The wonderful world of Word Embeddings: what are they and why are they needed? https://habr.com/ru/company/ods/blog/329410

10. Overview of the four popular NLP models https://proglib.io/p/obzor-chetyreh-populyarnyh-nlp-modeley-2020-04-21

X X

O

00 >

c.

X

00 m

o

ho o ho o

Аннотация научной статьи по математике, автор научной работы — Гринин И. Л.

Похожие темы научных работ по математике , автор научной работы — Гринин И. Л.

THE GENERATION OF COHERENT TEXTS. ANALYSIS OF NEURAL NETWORK MECHANICS. THE MECHANICS OF THE FIRST - LANGUAGE MODEL AS A TOOL FOR WORKING WITH THE LANGUAGE

Текст научной работы на тему «ГЕНЕРАЦИЯ СВЯЗНОГО ТЕКСТА. РАЗБОР НЕЙРОСЕТЕВЫХ МЕХАНИК. МЕХАНИКА ПЕРВАЯ - ЯЗЫКОВАЯ МОДЕЛЬ КАК СРЕДСТВО РАБОТЫ С ЯЗЫКОМ»