Научная статья на тему 'FLEXIBLE DEEP FOREST CLASSIFIER WITH MULTI-HEAD ATTENTION'

FLEXIBLE DEEP FOREST CLASSIFIER WITH MULTI-HEAD ATTENTION Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
90
12
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
MACHINE LEARNING / CLASSIFICATION / RANDOM FOREST / DECISION TREE / DEEP LEARNING / ATTENTION MECHANISM

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Konstantinov A.V., Utkin L.V., Kirpichenko S.R.

A new modification of the deep forest (DF), called the attention-based deep forest (ABDF), for solving classification problems is proposed in the paper. The main idea behind the modification is to use the attention mechanism to aggregate predictions of the random forests at each level of the DF to enhance the classification performance of the DF. The attention mechanism is implemented by assigning the attention weights with trainable parameters to class probability vectors. The trainable parameters are determined by solving an optimization problem minimizing the loss function of predictions at each level of the DF. In order to reduce the number of random forests, the multi-head attention is incorporated into the DF. Numerical experiments with real data illustrate the ABDF and compare it with the original DF.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «FLEXIBLE DEEP FOREST CLASSIFIER WITH MULTI-HEAD ATTENTION»

Intellectual Systems and Technologies Интеллектуальные системы и технологии

Research article

DOI: https://doi.org/10.18721/JCSTCS.16201 UDC 004.85

FLEXIBLE DEEP FOREST CLASSIFIER WITH MULTI-HEAD ATTENTION

A.V. Konstantinov' , L.V. Utkin' Q

S.R. Kirpichenko'

1 Peter the Great St. Petersburg Polytechnic University, St. Petersburg, Russian Federation

e lev.utkin@gmail.com

Abstract. A new modification of the deep forest (DF), called the attention-based deep forest (ABDF), for solving classification problems is proposed in the paper. The main idea behind the modification is to use the attention mechanism to aggregate predictions of the random forests at each level of the DF to enhance the classification performance of the DF. The attention mechanism is implemented by assigning the attention weights with trainable parameters to class probability vectors. The trainable parameters are determined by solving an optimization problem minimizing the loss function of predictions at each level of the DF. In order to reduce the number of random forests, the multi-head attention is incorporated into the DF. Numerical experiments with real data illustrate the ABDF and compare it with the original DF.

Keywords: machine learning, classification, random forest, decision tree, deep learning, attention mechanism

Acknowledgement: This work is supported by the Russian Science Foundation under grant 21-1100116.

Citation: Konstantinov A.V., Utkin L.V., Kirpichenko S.R. Flexible deep forest classifier with multi-head attention. Computing, Telecommunications and Control, 2023, Vol. 16, No. 2, Pp. 7-16. DOI: 10.18721/JCSTCS.16201

© Konstantinov A.V., Utkin L.V., Kirpichenko S.R., 2023. Published by Peter the Great St. Petersburg Polytechnic University

Научная статья

DOI: https://doi.org/10.18721/JCSTCS.16201 УДК 004.85

ГИБКИЙ КЛАССИФИКАТОР НА ОСНОВЕ ГЛУБОКОГО ЛЕСА С ИСПОЛЬЗОВАНИЕМ МНОГОМЕРНОГО ВНИМАНИЯ

А.В. Константинов' , Л.В. Уткин' Q

С.Р. Кирпиченко'

1 Санкт-Петербургский политехнический университет Петра Великого,

Санкт-Петербург, Российская Федерация

е lev.utkin@gmail.com

Аннотация. В статье предлагается новая модификация глубокого леса, называемая глубоким лесом на основе механизма внимания, для решения задач классификации при ограниченной выборке. Основная идея модификации заключается в использовании механизма внимания для агрегирования предсказаний случайных лесов в виде векторов вероятностей классов на каждом уровне или слое глубокого леса для повышения эффективности классификации все модели. Механизм внимания реализуется путем присвоения веса внимания конкатенированным векторам примеров и векторов вероятностей классов так, что модель внимания имеет обучаемые параметры. Обучаемые параметры определяются путем решения задачи оптимизации, минимизирующей функцию потерь ошибки предсказаний на каждом уровне глубокого леса в процессе обучения глубокого леса на каждом уровне. Чтобы уменьшить количество случайных лесов, в глубокий лес включено так называемое многомерное внимание. Численные эксперименты на реальных данных иллюстрируют предлагаемую модификацию с точки зрения точности классификации и сравнивают ее с оригинальным глубоким лесом.

Ключевые слова: машинное обучение, классификация, случайный лес, дерево решений, глубокое обучение, механизм внимания

Финансирование: Работа выполнена при поддержке гранта РНФ № 21-11-00116.

Для цитирования: Konstantinov A.V., Utkin L.V., Kirpichenko S.R. Flexible deep forest classifier with multi-head attention // Computing, Telecommunications and Control. 2023. Т. 16, № 2. С. 7-16. DOI: 10.18721/JCSTCS.16201

Introduction

A lot of ensemble-based machine learning methods have been proposed [1, 2] due to their efficiency. These methods use a combination of the so-called base models to obtain more accurate predictions. Three types of the ensemble-based methods can be pointed out: bagging [3], stacking [4], and boosting [5]. Each type of methods has cons and pros. One of the important bagging methods is the random forest (RF) [6], which combines predictions of many randomly built decision trees. RFs are popular because they are simply trained and provide outstanding results for many datasets.

RFs can be regarded as powerful machine learning models. However, they cannot compete with deep neural networks. In order to partially overcome this disadvantage Zhou and Feng [7] proposed the so-called Deep Forest (DF) or gcForest, which copies the structure of multi-layer neural networks and consists of several layers or forest cascades. Each layer of the DF consists of several RFs, which produced predictions combined to use them at the next layer. The DF does not require gradient-based algorithms for training. This peculiarity makes the DF simple. Moreover, they have less hyperparameters in comparison with neural networks. Due to efficiency of the DF, many modifications have been proposed [8—16]. The DFs were used in various applications [17—21].

© Константинов А.В., Уткин Л.В., Кирпиченко С.Р., 2023. Издатель: Санкт-Петербургский политехнический университет Петра Великого

In order to improve RFs, the attention-based RFs were proposed in [22], where the trainable attention weights are assigned to each tree and each example. The weights depend on how far an instance, which falls into a leaf of a decision tree, is from the instances, which fall into the same leaf. The attention weights in the RF are used to compute the weighted average of the decision tree predictions.

It is important to note that the attention mechanism is successfully applied to neural networks to enhance their prediction abilities. It is based on the human perception property to concentrate on an important part of information and to ignore other information [23]. Therefore, the attention mechanism opened a door for implementing many neural network architectures, including transformers, the natural language processing models, etc., which are considered in detail in [23—26].

The attention-based RFs (ABRF) opened another door to the attention models different from the neural networks or their components. Therefore, we proposed a new attention-based model incorporated into the DF to enhance the DF prediction accuracy. The main idea behind the attention in the DF is to assign the attention weights to every RF at each layer to optimally combine the RF predictions and to produce new attended training feature vectors at each layer of the DF for training trees and RFs at the next layer. The attention-based DF is abbreviated as the ABDF.

The paper is organized as follows. A short description of the DF proposed by Zhou and Feng [7] and the attention mechanism are given in Section 2 and 3, respectively. Section 4 shows a general architecture of the attention-based DF. Numerical experiments with real data illustrate the attention-based DF and compare it with the original DF in Section 4. Concluding remarks are provided in Section 5.

Before considering the weighted DF, we briefly introduce gcForest proposed by Zhou and Feng [7]. The DF can be divided into two parts. The main part of gcForest is a cascade forest structure where each level of a cascade receives feature information processed by its preceding level, and outputs its processing result to the next level [7].

The main part of the DF proposed in [7] is a cascade forest structure shown in Fig. 1. One can see from Fig. 1 that each layer (level) of the cascade consists of several RFs whose number is a tuning parameter. Every RF produces a class probability distribution vector. The probability distributions of classes are determined in the standard way by counting the percentage of different classes of instances at the leaf node where the considered instance falls into. The RF class probability vectors are computed by averaging the class distribution vectors across all trees in the RF. The vectors produced by all RFs at each level are concatenated to each other. Moreover, the obtained concatenated class probability distribution vectors are concatenated with the input feature vector producing the training or testing vector for the next level. The feature vectors of the last level are combined into a single class probability vector by means of averaging. The final prediction corresponds to the largest probability from the class probability vector. The greedy algorithm is used to train the DF so that the next level of the forest cascade is trained on the feature vectors obtained from the previous level.

We suppose that there are Q levels (layers) of the DF, every level contains F forests, every RF consists of T decision trees. It is assumed for simplicity that F and T are identical at all levels.

Suppose that there are n training instances S = {(Xj, yx), (x2, y2), ..., (x , yn)}, x = (x , ..., xm) e e Rm, is a feature vector from m features, y. e {1, ..., C} is the target output. The class probability vector = (pi j, ..., pC) as the prediction of the /th tree is defined as follows. Let the vector x fall into a leaf of the /th tree. Then there holds

where c is the class index c e {1, ., C}, n is the number of instances from the class c which fall into the same leaf as the vector x in the Ith tree.

A short introduction to the DF

Fig. 1. Architecture of cascade forest

In other words, p. is the percentage of instances from class c, which fall into the leaf where the instance x falls into. The following condition is fulfilled for all trees:

C

Z Pic = 1

c=l

The class probability vector v.(i) = (v. 1(i), ..., v(i)) as the prediction produced by the ith RF for x. is defined as

1 T

j(i) = tZPZ. c = 1,....C.

T t=1

According to [7], the concatenated vector x(q) after the qth level of the DF cascade is

x(Mxj. vj (1). .... vj (F)).

It consists of the original vector x. and F class probability vectors obtained from F RFs.

The attention mechanism and the attention-based RF

According to [24], the attention mechanism can be considered in terms of the Nadaraya—Watson kernel regression model [27, 28]. Given the training set S, the machine learning task is to find a functionf Rm ^ R predicting the target value y of a new instance x based on the dataset S. Then the Nadaraya-Wat-son regression model can be written as follows:

n

y=Za(x . x- h.

i=1

where a(x . x) are the attention weights depending on how the vector x from the training set is close to the input vector x, i.e. the closer x. to x, the greater a(x, x.). The weights are expressed through the kernel K as:

i \ K ( x. x. )

a( x. x. ) =-^-t-J—

Z П^ ( xx j)"

Vector x, vectors x. and outputsy are called query, keys and values, respectively, [29]. Generally, weight a(x, x.) depends on the trainable parameters w. If the Gaussian kernel is used to represent the attention weight, then we can write the following:

exp l-| |w ( x - x, )|| a ( x, x, ) = softmax ( x, x,, w ) =---

Xn=1exp (-| |w(x - x j |2)

Here w is the vector of trainable attention parameters, a(x, x w) is an attention scoring function that maps two vectors to a scalar. It should be noted that there are various forms of incorporating train-able parameters. As a result, different expressions for the attention weights or for the scoring function have been studied and proposed. One of the popular scoring functions is defined as

5 ( ^ x ) = wT tanh ( x + W, x.),

where w or W , W , and W, are the vector and matrices of trainable parameters.

v v q k A

The corresponding attention is the well-known additive attention [29]. Another popular attention is the dot-product attention [30, 31]. The attention-based RF proposed in [22] is based on the Huber's e-contamination model [32] with a specific trainable parameter, which is the contamination probability distribution.

Generally, the attention function (pooling) can be represented as an attention function f

e =

f (W9x, W,x,, Wvy ),

where e is the output of the attention module (embedding).

Another approach for improving and extending the attention mechanism is to use the multi-head attention which is based on joint use of the different representation of queries, keys, and values in order to take into account multiple different aspects of data. The multi-head attention is implemented by means of different trainable parameters (heads) wv(h), Wv(h), and W,(h). In this case, each attention head e(h) is written as

e(h>= f (W«x, Wi"x, W(h)*).

When the attention is implemented by neural networks, the heads are determined by different initialization of the neural network parameters. After computing vectors e(h), h = 1, ..., H, the heads are concatenated.

The attention-based DF

Let us return to the DF. Suppose that we have the trained RFs consisting of T decision trees at the first level of the forest cascade and the instance x is fed to the ith RF. Let us compute the reconstruction of input feature vector X (i ) produced by the ith RF as follows:

X (/) = £ a( X, X(k )(i)) X(k )(i),

k=1

where the reconstruction produced by kth tree is:

xik>(i)=_1_y x

Here 3i(k)(x) is the set of instances from the training set S which fall into the same leaf from the kth trees in the ith RF as the vector x falls; #3 (k)(x) is the number of elements in the set 3 (k)(x). It can be seen from

Fig. 2. The modified architecture of a level incorporating the multi-head attention

the above expression that x (i) can be viewed as the weighted average over the vectors from S which are close to x. It is important to point out that attention mechanism parameters for obtaining x (i) can be optimized. However, these approaches complicate the training procedure, and we use the simplest averaging based on Gaussian kernel.

In order to indicate that the multi-head attention with H heads is used, we will denote the mean vectors x (i) and the vector v(i) of the class probabilities as x (i. h) and v(i. h), where i is the number of the RF, i = 1, ..., F, h is the number of the head in the multi-head attention, h = 1, ..., H. So, the prediction of the ith RF at the first cascade, which is used in the hth head of the attention, is the vector of probabilities v(i. h). We propose to concatenate the vectors x (i. h) and v(i. h) in order to use the extended RF output (x(i. h)||v(i. h)). If there are FRFs at the level, then their outputs (x(i. h)||v(i. h)), i = 1, ..., F, can be combined by applying the multi-head attention with H heads. In this case, we obtain H embedding vectors e(h)(i), which can be concatenated for training the next level of the DF. The concatenated vector denoted as E is transformed to a vector x of the smaller size to use it at the next level of the DF cascade.

new

This scheme is repeated for each level.

The proposed attention-based architecture of the DF level is shown in Fig. 2. One can see from Fig. 2 that the input vector x is fed F to RFs (RF-i), which provide mean vectors x (i. h) and the probabilities of classes v(i. h). Then concatenated vectors x (i. h) ||v(i. h) are attended with the vector x (Attent h), and we obtain H vectors e(h)(i), which are concatenated to each other into the vector E. After that, the vector x is calculated as x = WE, where the matrix W is trained jointly with the attention mod-

new new c 1 c j j

ules. Predictions of each head in the multi-head attention depend on subset of samples that correspond to the head: only samples from the subset are used to reconstruct the input vector and to estimate the class probabilities. The subsets for heads are generated using H-fold division of the training set S. The attention parameters are trained by using the same folds.

The proposed architecture has several advantages. First of all, it is flexible. We can change the number of RFs, the number of heads in the multi-head attention. We can change sizes of embeddings e(h)(i), the size of the vector x . All attention modules as well as the procedure of reducing the concatenated vector

new

(e(h)(1)||...|| e(h)(H)) to the vector xnew have trainable parameters which allow us to obtain the best results. Secondly, we can reduce the number of RFs, which are hardly trained, by increasing the number of heads in the multi-head attention. This is a very important feature of the attention-based architecture. Thirdly, changing parameters of each level, we can obtain the heterogeneous structure of the DF, which leads to improved predictions of the whole model.

The simplest implementation of the attention-based DF is when the non-parametric attention mechanisms are used and the output feature vector xnew is obtained by averaging the vectors e(h)(1), ..., e(h)(H). In this case, we train only the RFs. Other components are performed by computing their outputs under condition of certain inputs.

Numerical Experiments

In order to illustrate the attention-based DF, we investigate the model for datasets from UCI Machine Learning Repository [33]. Table 1 is a brief introduction about these datasets, while more detailed information can be found from the data resources. Table 1 contains the number of features m for the corresponding data set, the number of instances n and the number of classes C.

The ABDF implementation is based on the Bosk framework which is available at https://github.com/ NTAILab/bosk.

Each level of the cascade structure consists of two RFs, each RF consists of 100 decision trees for almost all datasets except for the datasets WDBC, TTTE and Biodeg where numbers of trees in the corresponding RFs are 1000, 500, 500. The number of cascade levels is taken 3. The number of heads in the multi-head attention is 4.

Accuracy measure A used in numerical experiments is the proportion of correctly classified cases on a sample of data. To evaluate the average accuracy, we perform a cross-validation with 100 repetitions, where in each run, we randomly select n = 3n/4 training data and n = n/4 testing data. Different values for the hyperparameters were tested, choosing those leading to the best results.

Numerical results of comparison of the original DF and the ABDF are shown in Table 2, where the first column contains abbreviations of the tested data sets, the second column contains the accuracy (the mean and standard deviation) of the ABDF, the third column contains accuracy values of the original DF. It can be seen from Table 2 that the proposed attention-based DF outperforms the original DF for most considered datasets.

Another interesting question is how the number of heads in the multi-head attention impacts the prediction accuracy. To study this question, datasets WDBC and TTTE are used, and the accuracy measures are obtained for numbers of heads 2, 4, and 6. The corresponding values of the accuracy for the dataset WDBC are 95.34, 96.64, and 97.20. Values of the accuracy for the dataset TTTE are 96.87, 97.08, and 97.36. It can be seen from the results that the number of heads increases the classification accuracy. On the other hand, the large number of heads in the multi-head attention significantly increase the computation time for training the ABDF. An optimal number of heads can be selected only in the testing phase.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Table 1

Brief introduction to datasets

Data set Abbreviation m n C

Haberman's Breast Cancer Survival Haberman 3 306 2

Ionosphere Ion 34 351 2

Seeds Seeds 7 210 3

Teaching Assistant Evaluation TAE 5 151 3

Tic-Tac-Toe Endgame TTTE 9 958 2

QSAR Biodegradation Biodeg 41 1055 2

Parkinsons Parkinsons 22 195 2

Connectionist Bench Sonar 60 208 2

SPECT Heart SPECT 22 267 2

SPECTF Heart SPECTF 44 267 2

Breast Cancer Wisconsin WDBC 30 569 2

Table 2

Accuracy values (the mean and standard deviation) for comparison of the ABDF with the original DF

Dataset ABDF DF

Haberman 71.69±3.38 67.4±4.25

Ion 93.98±1.76 91.7±2.74

Parkinsons 92.65±2.08 91.84±3.65

Seed 95.28±3.50 93.21±2.56

SPECTF 80.15±4.63 81.04±4.43

SPECT 82.94±4.33 82.18±6.46

WDBC 96.41±2.14 95.31±1.90

Sonar 85.77±5.52 83.08±4.11

TAE 61.05±8.47 59.74±7.81

TTTE 97.92±1.05 97.63±0.93

Biodeg 86.63±1.38 87.27±1.65

Conclusion

The paper presented a new efficient modification of the DF. The main idea behind the proposed model is to incorporate the multi-head attention into each level of the DF. Numerical experiments showed that this idea leads to the model that outperforms the original DF.

The proposed model has several advantages. First, it allows us to reduce the number of RFs by increasing the number of heads in the multi-head attention mechanism at each level of the DF cascade. We can even use a single RF because the multi-head attention plays role of the base models like RFs. Secondly, it provides outperforming results due to usage of the attention-mechanism. Thirdly, it is flexible due to the data representation at the levels of the DF. Indeed, the output vector x can have a structure different

new

from the input vector produced by the previous level. As a results, RFs at the next level do not depend on RFs from the previous level, and we can expect better results due to some kind of the diversity of the base models. Fourthly, the ABDF opens the door for developing new modifications of the DF based on various forms of the attention mechanism. One of the direct modifications is to change the procedure for computing the average feature vector x (i) producing by the ith RF. We used the simplest procedure of weighted averaging of all vectors that fall into leaves jointly with the vector x(i). However, the self-attention can be applied to take into account the context of data as it is performed in Transformers. The self-attention can be incorporated into the multi-head attention. The above modifications as well as many other ones can be regarded as directions for further research.

REFERENCES

1. Rokach L. Ensemble-based classifiers. Artificial Intelligence Review, 2010, Vol. 33 (1-2), pp. 1-39.

2. Zhou Z.-FI. Ensemble Methods: Foundations and Algorithms. CRC Press, Boca Raton, 2012.

3. Breiman L. Bagging predictors. Machine Learning, 1996, Vol. 24 (2), pp. 123-140.

4. FI D. Stacked generalization. Neural networks, 1992, Vol. 5 (2), pp. 241-259.

5. Freund F.I., Schapire R.E. A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997, Vol. 55 (1), pp. 119-139.

6. Breiman L. Random forests. Machine learning, 2001, Vol. 45 (1), pp. 5-32.

7. Zhou Z.-FI., Feng J. Deep forest: Towards an alternative to deep neural networks. Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJ-CAIi17), strony 3553—3559, Melbourne, Australia, AAAI Press, 2017.

8. Shen-Huan Lyu, Yi-Xiao He, Zhi-Hua Zhou. Depth is more powerful than width with prediction concatenation in deep forest. Advances in Neural Information Processing Systems, 2022, no. 35, pp. 29719—29732.

9. Miller K., Hettinger C., Humpherys J., Jarvis T., Kartchner D. Forward thinking: Building deep random forests. arXiv:1705.07366, 20 May 2017.

10. Pang M., Ting K.M., Zhao P., Zhou Z.-FI. Improving deep forest by confidence screening. Proceedings of the 18th IEEE International Conference on Data Mining (ICDMi18), strony 1—6, Singapore, 2018.

11. Utkin L.V. An imprecise deep forest for classification. Expert Systems with Applications, 2020, Vol. 141 (112978), pp. 1-11.

12. Utkin L.V., Konstantinov A.V., Chukanov V.S., Meldo A.A. A new adaptive weighted deep forest and its modifications. International Journal of Information Technology & Decision Making, 2020, Vol. 19 (4), pp. 963-986.

13. Utkin L.V., Ryabinin M.A. A Siamese deep forest. Knowledge-Based Systems, 2018, no. 139, pp. 13-22.

14. Wen FI., Zhang J., Lin Q., Yang K., Jin T., Lv F., Pan X., Huang P., Zha Z.-J. Multi-level deep cascade trees for conversion rate prediction. arXiv:1805.09484, May 2018.

15. Heng Xia, Jian Tang, Junfei Qiao, Jian Zhang, Wen Yu. DF classification algorithm for constructing a small sample size of data-oriented DF regression model. Neural Computing and Applications, 2022, Vol. 34 (4), pp. 2785-2810.

16. Zhang X., Wang M. Weighted random forest algorithm based on bayesian algorithm. Journal of Physics: Conference Series, wolumen 1924, strona 012006. IOP Publishing, 2021.

17. Soheila Molaei, Amirhossein Havvaei, Hadi Zare, Mahdi Jalili. Collaborative deep forest learning for recommender systems. IEEE Access, 2021, no. 9, pp. 22053-22061.

18. Bishnupriya Panda, Shrabanee Swagatika, Sipra Sahoo, Debabrata Singh. A novel approach for breast cancer data classification using deep forest network. Intelligent and Cloud Computing: Proceedings of ICICC 2019, Springer, 2021, no. 2, pp. 309-316.

19. Liang Sun, Zhanhao Mo, Fuhua Yan, Liming Xia, Fei Shan, Zhongxiang Ding, Bin Song, Wanchun Gao, Wei Shao, Feng Shi, i in. Adaptive feature selection guided deep forest for covid-19 classification with chest ct. IEEE Journal of Biomedical and Health Informatics, 2020, Vol. 24 (10), pp. 2798-2805.

20. Ran Su, Xinyi Liu, Leyi Wei, Quan Zou. Deep-resp-forest: a deep forest model to predict anti-cancer drug response. Methods, 2019, no. 166, pp. 91-102.

21. Tianchi Zhou, Xiaobing Sun, Xin Xia, Bin Li, Xiang Chen. Improving defect prediction with deep forest. Information and Software Technology, 2019, no. 114, pp. 204-216.

22. Utkin L.V., Konstantinov A.V. Attention-based random forest and contamination model. Neural Networks, 2022, no. 154, pp. 346-359.

23. Niu Z., Zhong G., Yu FI. A review on the attention mechanism of deep learning. Neurocomputing, 2021, no. 452, pp. 48-62.

24. Chaudhari S., Mithal V., Polatkan G., Ramanath R. An attentive survey of attention models. ACM

Transactions on Intelligent Systems and Technology, 2021, Vol. 12 (5), pp. 1-32. Article 53.

25. Correia A.S., Colombini E.L. Attention, please! A survey of neural attention models in deep learning. Artificial Intelligence Review, 2022, Vol. 55 (8), pp. 6037-6124.

26. Lin T., Wang FI., Liu X., Qiu X. A survey of transformers. arXiv:2106.04554, Jul 2021.

27. Nadaraya E.A. On estimating regression. Theory of Probability & Its Applications, 1964, Vol. 9(1), pp. 141-142.

28. Watson G.S. Smooth regression analysis. Sankhya: The Indian Journal of Statistics, Series A, 1964, pp. 359-372.

29. Bahdanau D., Cho K., Bengio FI. Neural machine translation by jointly learning to align and translate. arXiv: 1409.0473, Sep 2014.

30. Luong T., Pham FI., Manning C.D. Efective approaches to attention-based neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, The Association for Computational Linguistics, 2015, pp. 1412—1421.

31. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I. Attention is all you need. Advances in Neural Information Processing Systems, 2017, pp. 5998—6008.

32. Huber P.J. Robust Statistics. Wiley, New York, 1981.

33. Lichman M. UCI machine learning repository, 2013. https://archive.ics.uci.edu/ml/index.php

INFORMATION ABOUT AUTHORS / СВЕДЕНИЯ ОБ АВТОРАХ

Andrei V. Konstantinov Константинов Андрей Владимирович

E-mail: andrue.konst@gmail.com https://orcid.org/0000-0003-2275-1473

Lev V. Utkin

Уткин Лев Владимирович

E-mail: lev.utkin@gmail.com https://orcid.org/0000-0002-5637-1420

Stanislav R. Kirpichenko Кирпиченко Станислав Романович

E-mail: kirpichenko.sr@gmail.com https://orcid.org/0000-0003-2275-1473

Submitted: 28.05.2023; Approved: 25.06.2023; Accepted: 06.07.2023. Поступила: 28.05.2023; Одобрена: 25.06.2023; Принята: 06.07.2023.

i Надоели баннеры? Вы всегда можете отключить рекламу.