THE ONLINE CLASSIFICATION OF THE MOBILE APPLICATIONS TRAFFIC USING DATA MINING TECHNIQUES
DOI 10.24411/2072-8735-2018-10317
Oleg I. Sheluhin,
Moscow Technical University of Communications and Informatics, Moscow, Russia, [email protected]
Viacheslav V. Barkov,
Moscow Technical University of Communications and Informatics, Moscow, Russia, [email protected]
Sergey A. Sekretarev,
Moscow Technical University of Communications and Informatics, Moscow, Russia,
Keywords: machine learning; applications, accumulation mode, Adaptive Random Forest, Hoeffding Adaptive Tree, K nearest neighbors, Oza Bagging, classification, online, data streams.
The article describes the features of the mobile application traffic classification in real time with the use of such algorithms as Adaptive Random Forest (ARF), Hoeffding Adaptive Tree, K nearest neighbors, and Oza Bagging. The comparison of two operating modes is carried out: with "limited" and "unlimited" memory. During the research, the traffic of six popular mobile applications received experimentally was analyzed. About 5,000 TCP connections for each application with the various distribution of traffic intensity were collected. The work considers the cases of the even and continuous traffic flow and also the case when the analyzed flows arrived unevenly. It is shown that the best quality metrics were shown by an Adaptive Random Forest algorithm both at even, and at uneven receipt of classified applications. It is shown that the ARF algorithm considerably surpasses Hoeffding Adaptive Tree, K nearest neighbors, and Oza Bagging algorithms in speed. It was found that in the case of uneven traffic flow, the best quality metric asses results are demonstrated by the accumulation mode with fixed window size.
Information about authors:
Oleg I. Sheluhin, Professor, d.t.s., head of Information Security department, Moscow Technical University of Communications and Informatics, Moscow, Russia
Viacheslav V. Barkov, Sn. Lecturer of Information Security department, Moscow Technical University of Communications and Informatics, Moscow, Russia
Sergey A. Sekretarev, Master student of Information Security department, Moscow Technical University of Communications and Informatics, Moscow, Russia
Для цитирования:
Шелухин О.И., Барков В.В., Секретарёв С.А. Классификация трафика мобильных приложений в потоковом режиме методами машинного обучения // T-Comm: Телекоммуникации и транспорт. 2019. Том 13. №10. С. 60-67.
For citation:
Sheluhin O.I., Barkov V.V., Sekretarev S.A. (2019). The online classification of the mobile applications traffic using data mining techniques. T-Comm, vol. 13, no.10, pр. 60-67.
T
Problem Formuhition
The classification [1.2,31 is a process of predicting an unknown attribute that characterizes ilie class of the observed sequence element, based on a mode! trained with a training data set. Unlike traditional classification algorithms, online classification algorithms cannot operate with all the data that can be split into training and test suites. As a result, the model should be trained and tested on the fly [12]. A bottleneck in the classification of streaming data is the need tor analysis in one pass. In a general sense, the single-pass analysis does not recognize the changes that have occurred in the model since the flow processing began. The classification process may require the model to be built and tested at the same time in an ever-charging environment. As a result, the testing process is carried out in constant competition with the training process.
Computational approaches to streaming data analysis should rely on statistics and computational theory [9]. The high dimensionality and speed of streaming data impose additional demands on computing resources in the system [10]. A number of processing methods have now been developed to process streaming data with high efficiency. Such methods can be divided into two categories: data-based methods and task-based methods 111 J.
There are many algorithms for classifying data in real-time, developed in recent years [7&9,13,14,15,16,17,22]. Each of these algorithms is characterized by its own capabilities and features of solving the problem of online data classification [5.6].
The aim of the article is to compare the Adaptive Random Forest, Hoe tiding Adaptive Tree, K nearest neighbors, and Oza [Jagging algorithms in the field of the online classification of the mobile traffic applications in cases with "limited" and "unlimited" memory.
The Characteristics and Structure of Online
Classification Algorithms
The Adaptive Random Forest, Hoeffding Adaptive Tree, K. nearest neighbors, Oza Bagging algorithms have become widespread in the lieid of online classification [ 11. The most common among them is the Adaptive Random Forest [IS] algorithm, which is an adaptation of the Random Forest |4,I9j algorithm (RF) for the case of streaming data.
As is known, the basic RF algorithm does several input passes when learning, which is not acceptable for streaming mode because the data can come at high speed and in high volumes. In addition, the aim of the streaming data classification is usually to process the analyzed sequence as quickly as possible. To address these shortcomings, the Adaptive Random Forest (ARF) algorithm implements two approaches:
1. Online Bagging;
2. Separating leaves using only a subset of original features.
The first approach is adopted from the Oza Bagging algorithm [20]. In the original Bagging algorithm |2I], each of the ti base models is trained on a sample containing 2 elements created from a training set by randomly selecting items that can be repeated. Each sample contains an element of the original set K limes, where K is subject to the binomial distribution. For large values, the binomial distribution of the Z sample is reduced to a Poisson with >1 = 1.
Based on this, the [20] proposes an algorithm that approximates random samplings [21 [ by assigning weights to the train-
ing set objects in accordance with the Poisson distribution with
A = 1.
The ARF algorithm uses a similar principle, bul the A = 6 ¡ 18] that allows higher weights to he assigned to the training set objects.
The second approach is implemented by modifying the tree formation algorithm used so that each div ision uses only a random subset of m features, where in < M and M is the total number of input data features. The pseudo-code describing the functioning of ARF is shown in algorithm I.
Algorithm 1 - Adaptive Random Forest
Designations:
m is the size of the ieatures subset used in dividing tree nodes;
/1 is the number of trees being trained;
5U, is a threshold factor for "drift prediction";
8,i is a threshold factor to determine the actual onset of drift;
T is a tree set used for predictions;
W is a tree weights set;
Sisa "spare trees" set;
S is a data flow;
x is a training object;
y is a training object class;
C isa drift-defining function;
PQ is a function of determining the training effectiveness;
t is a tree used for predictions;
b is a spare tree;
y is a prediction.
function A dap tivefíancl otnFo rest$(m, n, SW,S¡,) T = CreateTrees(n) W ~ InitWeigHts(n) B = 0
while HasNext(S) do
(x, y) — next(S) for all t€ T do
y = prediet'(t, x) W(0 - P(W{t),p,y) RFTreeTrain(rri, t,x,y) if CQSW, t, x,y) then
b = CreciteTreeQ Bit) = h
end if
IZC^t.X.y) dren t = Bit)
end if
end for
for all b € B do
RFTreeTrain (m.b.x.y)
end for end while end function
When dealing with stationary data streams that have not the concept drill, the part of the algorithm that works with "spare trees" is not used. Typically, drift detectors are used to work with non-stationary data flows, paired with an ensemble algorithm [23, 24).
Precision shows the pari of the objects called positive by the classifier and at the same time really positive:
' TP
precision = --- -- - ; f TP + FP
Recall (sensitivity) shows the part of the correctly labeled positive instances among all copies of the positive class:
TP
recall - ——prr: ;
TP + FN
F-score (Fp) combines the above two metrics, it is average harmonic precision and recall:
precision ■recall Fp^l 1 + p2) ■ ~f-r—--r.
F p • precision + recall
The F-score peaks at fullness when precision and recall equal to one. It is close to zero if one of the arguments is close to zero.
The effectiveness of the ARK algorithm was assessed in two modes of data stream processing. The principle of the training and test areas division in both modes is illustrated in figure 2.
From the experimental sequence of the streaming data, 100 intervals were formed with the duration of T = Ttrain + Ttest, where TtTa[n is the duration of the classifier training area, Ttest is the duration of the classifier testing area. The quality metrics were formed by processing multiple T intervals in the W — n • T window.
In the first ease, the duration of the processing window continuously increases as the data becomes available. This is achieved by fixing ihe left boundary of the window and moving the right boundary of the W window, as shown in figure 2a. This mode corresponds lo "unlimited" memory processing, where all previously observed results are stored and considered in pro-
cessing. Obviously, this mode is more acceptable for the ev and continuous flow of data (Figure la).
In the second case (figure 2b), the processing window du tion is fixed, but the window itself shifts along the time axis ct tinuously in accordance with the data flow dynamics. This mc is specific to non-stationary traffic (figure lb) and involves d carding "obsolete" data in "limited" memory mode.
The Results of the Online Classification of the Mobile
Application Traffic Using the ARF Algorithm
Let's consider the even distribution of data over time (s tionary mode), the illustration of which is shown in figure Let's introduce the K = Ttrain ft parameter, which characteri; the relationship between the duration of the training area and i duration of the "training-testing" pair. Figure 3 shows the rest: of the Precision metric that characterizes the quality of the AI classifier in even traffic flow and data accumulation mode.
From the results presented, it is clear that for most applii tions the best results are obtained with a ratio of the training a testing zones duration K - 0.5,
In the non-stationary mode of the data receipt (represented Figure lb), as in the case of even receipt, streaming traffic ct sisting of 100 T intervals was analyzed. For the case of unev traffic flow, Figure 4 shows the dependencies of Precision met values in the accumulation mode. From the submitted depends cies, it is clear that the ARF algorithm is doing a good job classifying when the data is uneven. For all applications (exci "Skype"), the best results are obtained with a ratio of the trairi and testing zones durations K = 0.83.
0,6 0,4 n,2 0
—--.
LfS^- rv— -— m
Ü 10000 20000 30000 ---K=0,01 -K-0,1 -K=0,25 K-0.S
a)
0,8 0,6 0,4 0,2 0
0 10000 20000 30000
-K=0,01 -«=0,1-K=0,25 «=0,5
b)
0 10000 20000 30000
-K-0,01 ---K=0,1 -K-0,25 — - K=0,5
C)
a 10000 20000 30000
-K=0,01 -K=0,1-K=0,25--K=0,5
d)
Figure 3. Precision metric averages for even data: □) «Ins tag ram»; b) «Sberbank Online»; c) «Mail.Ru»; d) «Skype»
From the submitted dependencies, it is clear that the ARF algorithm is almost 3 times Faster in the elassiFication of incoming data than the RF algorithm. RF, Hoeffding Adaptive Tree, K. nearest neighbors, Oza Ragging have either the worst quality performance or require more classification time.
Conclusions
Studies have shown that Hie ARF algorithm does a good job of classifying both evenly and unevenly in the ease of classified applications. For even traffic flow, it is recommended to use the K ■= O.S. In the ease of uneven traffic flow, the best results are given by 11K- 0.83.
it has been found that in the case of uneven traffic tlow, the best results are demonstrated by the mode of accumulation with a fixed window size of W ~ 207'.
Using the ARF algorithm allows classification to be carried out much faster (up to 3 times) than RF, Hoeffding Adaptive Tree, K nearest neighbors, O/a Bagging, which makes it the most preferred for classification in real time scale.
References
1. Sheluliin 0,1.. Erohin S.D., Vanyushina A.V. (20IS). IP traffic classification by machine learning methods. M.: Hotline - Telecom, 201284 p.
2. Erohin S.D., Vanyushina A.V. (2018),The choice of'attributes for classifying IP traffic using machine learning methods, T-Comm. Vol. 12., No. % pp. 25-29.
3. Sheluhin O.l., Barkov V.V., Polkovnikov M.V. (2019).Comparative analysis of the algorithms for estimating the number and structure of attributes in the classification problems of mobile applications. High-tech in Earth space research. Vol 11. NO, 2, pp. 90« 100. dm: 10.24411/2409-5419-2018-10263
4. Shehihin O.l.. Barkov V.V. (201 Si. Influence of background traffic on the effectiveness of mobile applications traffic classification using data mining techniques. T-Comntt vol. 12, no. 10, pp. 52-57.
5. Aggarwal C. (2017). Data Streams: Models ant! Algorithms. Boston: Springer. Vol. i. DOI: 10.1007/978-0-.>87-47534-9
6. Bifet A., Kirkby R. (2017). Data stream milling. A practical approach. Waikato: The University of Waikato. Vol. 1.
7. Bifet A., Kirkby R. (2017). Massive online analysis man tin I. Waikato: The University of Waikato. Vol, I.
8. Rajeev I.. Santosh K. (2016). A Quick Review of Data Stream Mining Algorithms, Imperial Journal of Interdisciplinary Research. Vol. 2. No. 7, pp. 870-873.
9. Mohammed H., Soliman (2010). .-J. Data stream mining. Data Mining and Knowledge Discovery Handbook. Ad M. Oded. R. IJor, New York: Springer, 2010. Vol, 1. C. 231-235.
10. Mining data streams (201 7), Mining of Massive Data sets. Ad. J. Leskivee, A. Ullma.ii, D. Jeffrey. Cambridge: Cambridge University Press, Vol. 2. C. 131-162, DOI: 10.1017/CB09781139924801.
11. Krentpl G, (2014). Open Challenges for Datu Stream Mining Research. SIGKDD Explorations. Vol. 18. № I. P. 10. DOI: 10.1145/2674026,2674028.
12. Robit В., Agarwal S. (2016). Stream Data Mining: Platforms, Algorithms, Performance К valuators and Research Trends International Journal of Database Theory and Application. Vol, 9, No. 9, pp. 201-218, DOI: 10.14257/ijdta.2016.9.9.19.
13. Fong S„ Wong R„ Vasilakos A, (2015). Accelerated PSO Swarm Search Feature Selection for Data Stream Mining Big Data. IEEE Transactions on Services Computing, pp. 1-1, DOI: 10.1109/TSC.2015.2439695.
14. Ueno K. et al. (2006). Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining. Sixth International Conference on Data Mining (/CDM'06). DOI: 10.1109/ICDM,2006.21.
15. Domingos P., Mnlten G. (2000). Mining high-speed data streams. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discover) and data mining - KDD '00. 20(10. DOI: 10,1145/347090.347107.
16 YusufB., Reddy P. (2012). Mining Data Streams using Option frees. International Journal of Computer Network and Information Security. Vol. 4. No. 8, pp. 44-54. DOI: 10.5815/ijcnis.2012.08.06.
17. Pfahringer В., Holmes G„ Kirkby R. 12007). New Options for Hoeffding Trees. AI 2007: Advances in Artificial Intelligence, pp. 90-99. DOI: 10.1007/978-3-540-76928-611.
18. Gomes H.M, et al. (20) 7). Adaptive random forests for evolving data stream classification. Machine Learning. Vol. 106. No. 9-10, pp. 1469-1495. DOI: 10.1007/s10944-017-5642-8.
19. Breiman L. (2001J. Random forests. Machine learning. Vol. 45. No. I, pp. 5-32. DOI; 10.1023/A:10!0933404324.
20. Oza N.C. (2005). Online bagging and boosting. DOI: 10.1109/ICSMC.2005.1571498.
21. Breiman L. (1996). Bagging predictors. Machine learning. Vol. 24. No. 2, pp. 123-140. DOI:T0.l023/A: 1018054314350,
22. Domingos P., Mullen G. (2000). Mining high-speed data streams. Proceedings of the sixth A CM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2000, pp. 71 -SO. DOI: 10.1145/347090.347107.
23. Bifet A. et al. (2010). Fasi perceptron decision tree learning from evolving data streams. Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg, pp. 299-310. DOI: 10.1007/978-3-642-13672-6 30.
24. Bifet A. ct al. (2009). New ensemble methods for evolving dam streams. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 139-148. DOI: 10.1145/1557019.1557041.
25. Page E.S. (1954). Continuous inspection schemes. Biometrika. 1954. Vol, 41. No. У:,рр. 100-115. DOI: I0.1093/biomet/4!.!-2.100.
26. Bifet A., Gavalda R. (2007). Learning from time-changing data with adaptive windowing. Proceedings of the 2007 SI AM international conference on data mining. Society for Industrial and Applied Mathematics, pp. 443-448. DOI: 10.1137/1,9781611972771.42,
КЛАССИФИКАЦИЯ ТРАФИКА МОБИЛЬНЫХ ПРИЛОЖЕНИЙ В ПОТОКОВОМ РЕЖИМЕ МЕТОДАМИ МАШИННОГО ОБУЧЕНИЯ
Шелухин Олег Иванович, Московский технический университет связи и информатики, Москва, Россия, [email protected] Барков Вячеслав Валерьевич, Московский технический университет связи и информатики, Москва, Россия, [email protected] Секретарёв Сергей Александровича, Московский технический университет связи и информатики, Москва, Россия, [email protected]
Аннотация
Рассмотрены особенности классификации трафика мобильных приложений в режиме реального времени с использованием алгоритмов Adaptive Random Forest (ARF), Hoeffding Adaptive Tree; K nearest neighbors. и Oza Bagging. Проведено сравнение двух режимов работы: с "конечной" и "бесконечной" памятью. При проведении исследований анализировался трафик шести популярных мобильных приложений, полученный экспериментальным путем. От каждого приложения было собрано около 5000 TCP-соединений с различным распределением интенсивности трафика. В работе рассмотрены случаи равномерного и непрерывного поступления измерений, а также случай, когда анализируемые приложения поступали неравномерно. Показано, что наилучшие качественные показатели продемонстрировал алгоритм Adaptive Random Forest как при равномерном, так и при неравномерном поступлении данных классифицируемых приложений. Показано, что алгоритм ARF значительно превосходит алгоритмы Hoeffding Adaptive Tree, K nearest neighbors, Oza Bagging по быстродействию. Найдено, что в случае неравномерного поступления трафика лучшие результаты оценки метрик качества демонстрирует режим накопления с фиксированным размером окна.
Ключевые слова: машинное обучение, приложения, режим накопления, Adaptive Random Forest, Hoeffding Adaptive Tree, K nearest neighbors; Oza Bagging, Классификация, Online; потоковые данные.
Литература
1. Шелухин О.И., Ерохин С.Д., Ванюшина А.В. Под ред. Шелухина О.И. Классификация IP-трафика методами машинного обучения. М.: Горячая-линия - Телеком, 2018. 284 с.
2. Ерохин С.Д., Ванюшина А.В. Выбор атрибутов для классификации IP-трафика методами машинного обучения // T-comm: Телекоммуникации и транспорт. 2018. Т.12. №9. С. 25-29.
3. Шелухин О.И., Барков В.В., Полковников М.В. Сравнительный анализ алгоритмов оценки количества и структуры атрибутов в задачах классификации мобильных приложений // Наукоемкие технологии в космических исследованиях Земли. 2019. Т. 11. № 2. С. 90-100. doi: 10.24411/2409-5419-2018-10263.
4. Sheluhin O. I., Barkov V.V. (2018). Influence of background traffic on the effectiveness of mobile applications traffic classification using data mining techniques. T-Comm, vol. 12, no.10, pр. 52-57.
5. Aggarwal C. Data Streams: Models and Algorithms. Boston: Springer, 2017. Вып. 1. DOI: 10.1007/978-0-387-47534-9.
6. Bifet A., Kirkby R. Data stream mining. A practical approach. Waikato: The University of Waikato, 2017. Вып. 1.
7. Bifet A., Kirkby R. Massive online analysis manual. Waikato: The University of Waikato, 2017. Вып. 1.
8. Rajeev T., Santosh K. A Quick Review of Data Stream Mining Algorithms // Imperial Journal of Interdisciplinary Research. 2016. Т. 2. № 7. С. 870-873.
9. Mohammed H., Soliman A. Data stream mining // Data Mining and Knowledge Discovery Handbook / под ред. M. Oded, R. Lior. New York: Springer, 2010. Вып. 1. С. 231-235
10. Mining data streams // Mining of Massive Datasets / под ред. J. Leskivec, A. Ullman, D. Jeffrey. Cambridge: Cambridge University Press, 2017. Вып. 2. С. 131-162. DOI: I0.I0I7/CBO978II3992480I.
11. Krempl G. Open Challenges for Data Stream Mining Research // SIGKDD Explorations. 20I4. Т. I8. № I. С. I0. DOI: I0.II45/2674026.2674028.
12. Rohit B., Agarwal S. Stream Data Mining: Platforms, Algоrithms, Performance Evaluators and Research Trends // International Journal of Database Theory and Application. 20I6. Т. 9. № 9. С. 20I-2I8. DOI: I0.I4257/ijdta.20I6.9.9.I9.
13. Fong S., Wong R., Vasilakos A. Accelerated PSO Swarm Search Feature Selection for Data Stream Mining Big Data // IEEE Transactions on Services Computing. 20I5. С. I-I. DOI: I0.II09/TSC.20I5.2439695.
14. Ueno K. и др. Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining // Sixth International Conference on Data Mining (ICDM'06). 2006. DOI: I0.II09/ICDM.2006.2I.
15. Domingos P., Hulten G. Mining high-speed data streams // Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '00. 2000. DOI: I0.II45/347090.347I07.
16. Yusuf B., Reddy P. Mining Data Streams using Option Trees // International Journal of Computer Network and Information Security. 20I2. Т. 4. № 8. С. 49-54. DOI: I0.58I5/ijcnis.20I2.08.06.
17. Pfahringer B, Holmes G, Kirkby R. New Options for Hoeffding Trees // AI 2007: Advances in Artificial Intelligence. С. 90-99. DOI: I0.I007/978-3-540-76928-6_II.
18. Gomes H. M. et al. Adaptive random forests for evolving data stream classification //Machine Learning. 201 7. Т. 106. №. 9-I0. С. 1469-1495. DOI: I0.I007/sI0994-0I7-5642-8.
19. Breiman L. Random forests // Machine learning. 200I. Т. 45. №. I. С. 5-32. DOI: I0.I023/A:I0I0933404324.
20. Oza N.C. Online bagging and boosting. - 2005. DOI: I0.II09/ICSMC.2005.I57I498.
21. Breiman L. Bagging predictors // Machine learning. I996. Т. 24. №. 2. С. I23-I40. DOI: I0.I023/A:I0I80543I4350.
22. Domingos P., Hulten G. Mining high-speed data streams //Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2000. С. 7I-80. DOI: I0.II45/347090.347I07.
23. Bifet A. et al. Fast perceptron decision tree learning from evolving data streams // Pacific-Asia conference on knowledge discovery and data mining. - Springer, Berlin, Heidelberg, 20I0. С. 299-3I0. DOI: I0.I007/978-3-642-I3672-6_30.
24. Bifet A. et al. New ensemble methods for evolving data streams //Proceedings of the I5th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009. С. I39-I48. DOI: I0.II45/I5570I9.I55704I.
25. Page E. S. Continuous inspection schemes // Biometrika. I954. Т. 4I. №. I/2. С. I00-II5. DOI: I0.I093/biomet/4I.I-2.I00.
26. Bifet A., Gavalda R. Learning from time-changing data with adaptive windowing //Proceedings of the 2007 SIAM international conference on data mining. - Society for Industrial and Applied Mathematics, 2007. С. 443-448. DOI: I0.II37/I.978I6II97277I.42.
Информация об автореах
Шелухин Олег Иванович, профессор, д.т.н., заведующий кафедрой "Информационная безопасность", Московский технический университет связи и информатики, Москва, Россия
Барков Вячеслав Валерьевич, старший преподаватель кафедры "Информационная безопасность", Московский технический университет связи и информатики, Москва, Россия
Секретарёв Сергей Александровича, магистрант кафедры "Информационная безопасность", Московский технический университет связи и информатики, Москва, Россия