Генерация связного текста. Разбор нейросетевых механик. Механика вторая - модель обучения для работы с нейронной сетью
Гринин Игорь Леонидович
магистрант, кафедра программного обеспечения автоматизированных систем, Волгоградский государственный технический университет (ВолгГТУ), [email protected]
Данная статья является второй в серии из трех статей, посвященных разбору работы механик модели генерации связного текста при помощи нейронных сетей. В этой статье рассматриваются принципы обучения нейронных сетей, различные модели, способствующие глубокому обучению нейронных сетей, а также уделяется внимание некоторым подвидам отдельных моделей глубокого обучения. Методами исследования являются сравнительный анализ двух крупнейших нейросетей, обученных при помощи различных методологий глубокого обуче-ния,разбор лежащих в их основе моделей обучения, а также подробный анализ принципов их работы. Результатом исследования стало научное и экспериментальное сравнение двух разных моделей обучения нейронных сетей. В ходе исследования была создана таблица, в которой для описания параметров и характеристик каждой из предоставленных моделей были вписаны оценки, полученные как экспериментальным путем, так и основанные на научной статистике. Итогом стало сравнение реальной работы двух нейронных сетей, обученных на различных моделях глубокого обучения, с данными, предоставленными разработчиками в официальных документах. На основе полученных результатов сделаны соответствующие выводы. Также был получен ряд теоретических знаний, для работы с текстом, которые могут стать полезны для различных возможных обработок текстовых данных. Ключевые слова: анализ текста, векторное представление слов, программирование, обучение нейронных сетей, модель обучения, предтренированные веса
Introduction
In the modern world of constantly developing information technologies and development of the artificial intelligence, the role of a coherent text generation model is constantly growing.
At the moment, the scientific literature has accumulated a large amount of information about the training models of neural networks [1-4]. Its analysis in the elibrary shows the extreme popularity of learning models, in total, about 40000 articles have been published. However, this is a lack of research into models as components of a single system. Besides, unfortunately, this question is not systematically studied.
The article discusses a text operation module, which is a neural network training model.
Deep learning and text prediction are new branches of technology that requires a huge amount of test data for learning. Multi-layer neural networks are a universal approximator, that is, they can be used to simulate any problem. At the same time, there are no theoretical restrictions for the successful result. However, in the real world there are many such limitations, ranging from an insufficient amount of data, and ending with the ultimate computing power. Therefore, for the quality of operations, all the largest neural networks are counted in clusters over a long period of time.
Most of these networks have millions of parameters. If we teach these networks with small data sets, it can lead to overload, i.e. the network will only work for examples in the training data or exactly similar examples, but will not show a positive result in data generalization ( i.e. won't work on additional examples). If on the contrary - to set too much data, it will lead to the problem, when the neural network will start to perceive any input parameter as an object for processing, which also negatively affects the operation process.
Since our work is devoted to a text generation model, we will consider training approaches using texts as an example.
X X
о
го А с.
Low Data Transfer Learning (Transfer learning aka Fine-tuning)
To begin with, it should be clarified that neural networks are almost never trained from scratch. In practice, instead, a large set of suitable data (in our case - texts) is taken. Its size may vary from tens of thousands to millions - all depending on the necessary tasks - the quality, accuracy of predictions and other required of the neural network. In the figure below, this dataset is in block 5.
X
ro m
о
M
о
M
о
o
CN
o
CN
O HI
m
X
3
<
m o x
X
Fig. 1 Structural components of the learning model
The model trained on this data block is saved into the "pretrained" weights (block 4 in Figure 1). When a person starts working on a specific problem where there is little training data available, they uses these pre-prepared weights and continue training.
It is necessary to notice that the text for a task should be similar to a data set on which the model was initially trained, otherwise the previous training will not be effective. To understand the learning process, let's give an analogy: children learn to read alphabetically and then begin to read words. Pre-training networks select weights in such a way that the network is familiar with the types of texts that are common within the training. When training with small datasets, it is easier for us to get weights that are appropriate for problem solving[5].
Fine-tuning strategies:
1. Linear support vector machine on top of function bottlenecks
With a small amount of data, it is impossible to pre-train a large number of scales. The best strategy in this case would be to train with support vector machines, also known as SVMs, on top of the output of the convolutional layers, just before the fully connected layers (also called bottlenecks).
2. Training only the last few layers
Depending on the amount of data available, the complexity of the problem being solved, it is possible to choose the approach of freezing the first few layers and training only the last few layers. The initial layers of neural networks simply examine the common features of the input data. A deeper part of the networks studies the specific shapes and parts of objects that are trained in this method. This method consists of using zero or very low learning rates for the primary layers and using higher learning rates for deeper layers.
3. Freezing, Pretreatment and Finetune (FPT)
This is one of the most effective techniques. It involves two steps:
a) freezing and pre-preparation: first, the last layer is replaced with a small mini-net of 2 small fully connected layers. After that, all previously prepared layers are frozen and trained into a new network. The weights of this network are stored in pre-trained weights.
b) Finetune: pre-trained weights are loaded and train the entire network at the lowest speed learning. This results in very high accuracy even with small datasets.
4. Training all layers
If there is enough data for learning, one can start with the pretrained weights and use them to train the entire network.
Training with substitution and skipping of training data (MASK Trainig by Google BERT)
The uniqueness of this model is that it works in interaction with its own language model. For more details on language models and word embedding in our previous works, see [7].
To submit text to the input of a neural network, you need to represent it as numbers. The easiest way to do this is letter-by-letter, feeding one letter to each input of the neural network. In this case, each letter will be encoded with a number from 0 to 32 (plus punctuation marks). This is the so-called character-level, or Symbolic embedding.
However, the results are much better when more significant elements are fed to the network input instead of single symbols - individual syllables or whole words. This is already called word-level, or word embeddings.
The easiest option is to create a dictionary with all the words present in the text, and submit the word number to the network. from this dictionary. For example, if the word "house" is in the 123rd place in the dictionary, then the input networks for this word is 123.
However, in natural language, with the word "home" a person has many associations: "cozy", "native", "brick". This feature of the language causes additional difficulties, however, despite this, it is possible to improve the quality of the model significantly. Google solved this situation as follows.
In order for the word "house" to have an associative array, it is necessary to re-sort the word numbers so that words close in meaning stand next to each other. Let it be, for example, for "house" the number 123, and for the word "brick" the number 122. And for the word "bench" the number is 900. As one can see, the numbers 122 and 123 are much closer to each other than the number 900.
Fig. 2 BERT vectors
Each word is assigned not one number, but several numbers, which form a vector, from a certain number of numbers. Figure 2 shows the structure of the BERT vectors. A detailed description of how such vectors works is described in our previous work[7].
Now let's consider the model learning itself.
The idea behind BERT is very simple: at the input of the neural network will be fed phrases in which 10 to 15% of the words are replaced by gaps ([MASK]), and the neural network is trained to predict these masked words.
Example if you submit input networks the phrase "I won [MASK] and received [MASK] ", it should display the words" competition "and" prize "at the exit. This is a simplified example from the official BERT page, on longer sentences the range of possible options becomes smaller, and the answer networks is more unambiguous.
In order for the neural network to learn to understand the relationships between different sentences, it is additionally trained to predict whether the second phrase is a logical continuation of the first. Or is it some random phrase that has no relation to the first.
So, for the two sentences: "I won the contest." and "And got the prize.", the neural network should respond that it makes sense. And if the second phrase is "tomato green pigeon", then it has to answer that this sentence has nothing to do with the first[6].
Comparative analysis based on practical application models
Consider the practical application of these models learning large corporations involved in the development of artificial intelligence. With the help of this analysis, we can conclude about the applicability and relevance of the use of data models.
We considered examples of work on example two largest neural networks: GPT-3, created by OpenAI on Technology fine-tuning, and already described above by Google BERT. The main parameters for comparison of neural networks in this case are 2 indicators - the number of input parameters, responsible for the quality of learning and accuracy of prediction (all data are taken from open sources).
Table 1
Number of parameters Prediction accuracy
GPT-3 175 60%
BERT 355 million 93%
Based on these results, we made the assumption that the model BERT, which is more accurate, will better cope with its main task of generating text. So when testing the model, it will give out the so-called. state-of-the-art text (text close to natural language).
GPT-3 is a model with a multiplicity of parameters. In this regard, we assumed that in addition to its main task, it could also handle other language tasks, such as dialogue.
We ran a series of tests with these two models, which are analyzed in the table below. We estimated the accuracy of the results as a percentage of the correspondence of the texts to the test examples of training.
Table 2
Text generation Dialogue
GPT-3 55% 53%
BERT 85% 30%
Conclusions and results
As can be seen from the table, our comparative analysis of the models showed the absence of a clear dominance of one model.
BERT has a much higher accuracy (93% versus 60%), due to the constructed model learning with missing words. This allows the model to predict the text much more accurately and qualitatively, similar to the ones it was trained on. However, relatively small number of parameters (here it should be noted that 355 million is an enormous quantity of parameters for the usual networks, but since we are talking about the giants of the global industry of artificial intelligence, we will consider this number as small ), limited by the capabilities of the model learning itself, sets a framework in work variations.
At the same time, GPT-3, being the largest neural network at the moment, has 175 parameters. This allows it to process any text, in any language, without having to retrain each time for new tasks. But, due to the low prediction accuracy, the quality of the texts is not always satisfactory.
The results of the table show that our assumptions have been confirmed.
It is worth noting that each model, according to its strengths, should be applied in on the from the amount of data training and the goals required from the neural network. Unfortunately in working with models, this point is often not taken into account, which leads to suboptimal results. This once again shows the importance of choosing the optimal model learning.
The generation of coherent texts. analysis of neural network mechanics. Mechanics two - learning model for working with a neural network Grinin I.L.
Volgograd State Technical University
This article is the second in a series of three articles devoted to the analysis of the mechanics of the model for generating connected text using neural networks. This article discusses the principles of learning neural networks, various models that contribute to deep learning of neural networks, and also focuses on some subspecies of individual models of deep learning. The research methods are a comparative analysis of two major neural networks trained using various deep learning methodologies, an analysis of the underlying learning models, and a detailed analysis of the principles of their operation. The result of the study was a scientific and experimental comparison of two different neural network learning models. In the course of the study, a table was created in which estimates obtained both experimentally and based on scientific statistics were entered to describe the parameters and characteristics of each of the provided models. The result was a comparison of the actual operation of two neural networks trained on different deep learning models with the data provided by the developers in official documents. Based on the results obtained, the corresponding conclusions are made.We also obtained a number of theoretical knowledge for working with text, which can be useful for various possible processing of text data. Keywords: text analysis, vector representation of words, programming, neural networks' training, learning model, pretrained weights References
1. Application of the Universal language model fine-tuning method
for the task of classifying intentions / Morkovkin A.G., Popov A.A. In the collection: Science. Technology. Innovation. Collection of scientific papers. In 9 parts. Edited by A.V. Gadyukina. 2019.S. 168-170.
2. Symmetry breaking, duality and fine-tuning in hierarchical spin
models / Godina J.J., Meurice Y., Niermann S., Oktay M.B.
X X
o
OD >
c.
X
OD m
o
ho o ho o
Nuclear Physics B - Proceedings Supplements. 2000. T. 83-84. No. 1-3. S. 703-705.
3. Fine-tuning constraints on supergravity models / Bastero-Gil M.,
Kane G.L., King S.F. Physics Letters. Section B: Nuclear, Elementary Particle and High-Energy Physics. 2000. T. 474. No. 1-2. S. 103-112.
4. Improving the quality of machine learning models in image classification problems based on the approaches of feature extraction and fine tuning of the model / Petrin DA, Belov Yu.S. Electronic journal: science, technology and education. 2020. No. 1 (28). S. 104-111.
5. Transfer Learning: how to quickly train a neural network using
your data
https://habr.com/ru/company/binarydistrict/blog/428255/
6. BERT is a state-of-the-art language model for 104 languages.
https://habr.com/ru/post/436878/
7. Development, testing and comparison of models for sentimental
analysis of short texts. Grinin I.L. Innovation and investment №6 2020 p. 186-190
o
CN O CN
O HI
m x
3
<
m o x
X