INFLUENCE OF BACKGROUND TRAFFIC ON THE EFFECTIVENESS OF MOBILE APPLICATIONS TRAFFIC CLASSIFICATION USING DATA MINING TECHNIQUES
DOI 10.24411/2072-8735-2018-10157
Oleg I. Sheluhin,
Moscow Technical University of Communications and Informatics, Moscow, Russia, [email protected]
Vyacheslav V. Barkov,
Moscow Technical University of Communications and Informatics, Moscow, Russia, [email protected]
Keywords: classification, reliability, DataMining, attributes, Random Forest, SVM, Naive Bays, C4.5, Adaptive Boost, metrics, protocol, flow, packet, application.
The article is devoted to mobile traffic classification by applications using machine-learning algorithms. It considers the influence of background traffic (BT) on the classification quality. The paper searches the best and the worst classification algorithms. It also searches for application types that are most affected by background traffic.
We selected a set of twelve most popular applications for classification, and an additional set of six applications for background traffic. We used the widespread algorithms for machine learning, such as Naive Vayes, C4.5, Adaptive Boost, Random Forests (RF), and Support Vector Machine (SVM).
To assess the effectiveness of classification algorithms we used such metrics as Precision, Recall, F-Measure, ROC curves (Receiver Operating Characteristic Curve), and AUC (Area Under Curve). After processing a large number of experimentally obtained data, we noticed that the classification quality of all considered algorithms is decreased when there was background traffic.
In order to improve classification effectiveness we suggested adding additional class, called "Unknown application", and showed the effect on the mobile applications classification quality.
Information about authors:
Oleg I. Sheluhin, Professor, d.t.s., head of department "Information security", Moscow Technical University of Communications and Informatics, Moscow, Russia
Vyacheslav V. Barkov, Sn. Lecturer of department "Information Security", Moscow Technical University of Communications and Informatics, Moscow, Russia
Для цитирования:
Шелухин О.И., Барков В.В. Влияние фонового трафика на эффективность классификации трафика мобильных приложений методами интеллектуального анализа данных // T-Comm: Телекоммуникации и транспорт. 2018. Том 12. №10. С. 52-57.
For citation:
Sheluhin O. I., Barkov V.V. (2018). Influence of background traffic on the effectiveness of mobile applications traffic classification using data mining techniques. T-Comm, vol. 12, no.10, pр. 52-57.
Problem formulation
According lo statistics [ IJ collected in December 2017, about 66% of all network traffic is generated by mobile devices (smartphones and tablets). The problem of access control to Internet resources in mobile applications is relevant and important. This can be like blocking access to illegal (extremist, antisocial and other) information and preventing the use of Internet resources for other purposes, in particular, restricting and controlling access to entertainment and other resources for personal use. It is also important to solve the problem of preventing the leakage of confidential information via the Internet.
The identification by the communication operator of the applications used by the subscriber is important for the statistics analysis. Such statistics help not only to monitor the network status, detect failures, but also, if necessary, restrict access to network resources that, from the point of view of information security, can harm the user.
To solve such problems, methods of data mining (Data mining) have been widely used [4, 5, 6]. Such methods will allow the developed system to easily adapt to the constantly changing nature of Internet resources and take into account the specificity of analyzing network traffic.
The implementation of the proposed approaches will allow the classification, analysis and filtering of network traffic of malicious and unwanted applications, with higher efficiency.
At the same time, most of the known works devoted to the problem of traffic classification |5 ... 10] do not lake into account the fundamental requirement of determining the unknown type of traffic. In some cases, when designing classifiers with a teacher, unknown traffic is completely excluded from consideration and it is assumed that only known classes of applications are available. In other cases, unknown traffic was not present at all in most experiments, and classifiers were trained on data lrom a limited number of application classes and tested using other data from the same known classes.
The aim of the article is to study the features of the classification of mobile application traffic using machine learning methods in the presence of background traffic; search for the best classification algorithm; analysis of the most sensitive types of application traffic and classification algorithms.
Implementation of the proposed approaches will allow classifying, analyzing and filtering of network traffic of malicious and unwanted applications, with higher efficiency in accordance with the proposed indicators.
Capture and analyze the network traffic of mobile apps
For creation of a database of mobile traffic of mobile applications, we designed and developed the "Traffic Analysis System" software package, which includes a database server, an application server, a Web application and client software for mobile devices running the Android operating system (mobile client).
The process of traffic collection using the "Traffic Analysis System" software package, as well as the interaction of the components of the software package with each other and with external mobile applications are shown in Fig. I.
A mobile client of the "Traffic Analysis System" software package is installed on a smartphone or tablet running the Android operating system. This client intercepts network traffic packets of the specified applications that is also installed on this device.
Середа грнлеемн на. с кегшрьи ведется ____Портрет™
сттарьи Ктоансим по ^^
ведется сбор л[исраминога
П^ЧиХЖЙМИЧ.
с кстоьос Кпвонсюв ПО ведется cfipp
данных MySQL Серверная ЭВМ Серверное ПО прогрвии*нсхо компла "Система Анап нза Трафика'
Fig, 1. Schcme of collect of mobile traffic
Intercepted packets of network traffic sent to the application server of the "Traffic Analysis System" software package installed on a server computer controlled by the Windows Server 2016 operating system.
The application server of the "Traffic Analysis System" software package groups network traffic packets into llovvs and, using the database server, saves data to the database.
Table 1
Summary table of the collected database of mobile traffic
№ Application name Name oftraffic packet Type of traffic Flows count Packets CO it til
1 Mail.m m.mail.mailapp L:'n cry pled 5,078 246,184
2 S be r ha ilk online ru.sberbank mobile Encrypted 5,110 241,235
3 Skype com .skype, raider Encrypted 5,244 232,510
4 Pikabu ni. pikabu, android Encrypted 5,329 265,071
5 1 nstagram com. i nstagram. android Encrvpted 4,979 1,916,363
6 Hearthstone com.blizzard. wtcgJiearth stone Encrypted 5,028 227,688
7 Wolfram com.wolfram .andro i d .alp ha Without encryption 5,190 61,140
8 Moskovsky komsomolets com.mobilein.mk Without encryption 5,335 107,202
9 Fishki net com .klrik88.fi reader Without encryption 5,422 576,581
10 NTV ru.ntv. client Without encryption 5,908 233,982
II Pizza Sushi Vok ru. i t s i f ver. piz/.aem p i re Without encryption 5,097 64,460
12 Godville ru.godvillc.android Without encryption 5,016 61,343
13 Google Chrome com.andro id.chrome Partiaily encrypted 3,865 620,277
14 Kommcrsant co m .it sad v. kom mersant Partially encrypted 5,325 338,327
IS Booking com .booking Partially encrypted 5,326 552,606
16 4PDA ru.lburpda.client Partially encrypted 4,974 524,215
17 Yandex browser with Alisa com .yandex .brow scr Partially encrypted 5,132 139,595
18 Baitoo co m.badoo, mobile Partially encrypted 4,976 581,2)2
Data exchange between the components of the "Traffic Analysis System" software is carried out via the global Internet using the HTTP protocol in JSON format.
7ТЛ
Y
The application server includes a Web service that provides REST AIM for accessing the collection of network traffic packets, managing datascts, creating and training classifiers, classification and other functions.
With the use of the software complex, traffic of mobile applications of three categories was collected; "With traffic encryption", "Without traffic encryption", "With partial encryption of traffic".
In the process of collecting traffic of mobile applications, 92,334 flows and 6,989,991 packet of network traffic of 18 mobile applications are collected.
The characteristics of the created database are given in table 1.
Algorithms and metrics of classification algorithms
For the classification of applications, the following algorithms of machine learning were used: Naive Bayes; C4.5 [8j; Random Forests [9]; Support Vector Machine (SVM) [10]; Adaptive Boost. To evaluate the effectiveness of classification algorithms, the following information search metrics [4, 5, 6, 7] were used: Precision; Recall; F-Measure; ROC curves (Receiver Operating Characteristic Curve); AUC (Area Under Curve).
Results of classification of mobile applications
in the presence of background traffic
During the experiment, two sets of data were used. The first set of data included 12 selected applications without background traffic. The second set of data in addition to the 12 selected applications included background traffic, which was the remaining 6 of the 18 applications listed in Table 1. The results of the classification are shown in Tables 2 and 3.
Table 2 shows the summary data of the Recall parameter. Table 3 shows the algorithms Precision for each class, for both sets of data.
Table 2
Summary table of the Recall parameter
Table 3
Summary table of the Precision parameter
Classification algorithm Rhtnliini Forest Naive Bayes SVM C4.5
Application Backgrun ml Background Background BackgrOB nil
No Yes No Yes No Yes No Yes
Instaura m 0.994 0,99 0.073 0.075 0.328 0.336 0,992 0,983
Mail.ru 0,979 0,984 0.642 0.714 0.736 0.767 0,975 0,984
Skype 0,989 0.989 0.773 0.769 0.742 0.731 0,996 0,989
Sberbank online 0,989 0.98 0.094 0.097 0.314 0.325 0.972 0,97
Fishki.nei 0,976 0.985 0.593 0.613 0.522 0.552 0,971 0.976
1 learthstone 0,984 0.989 0.296 0.28 0.305 0.295 0,982 0.979
Pikabu 0.977 0.97 0.542 0.537 0.773 0.774 0.964 0.968
Wolfram 1 1 0.885 0.896 0.951 0.949 I 1
Pizza Sushi Vok 0.982 0.99 0.845 0.854 0.824 0.84 0.984 0.99
Moskovsky kom so molcls 1 0.998 0.483 0.459 0.54 0.532 0.998 0.998
Godville 0,993 0.993 0.94 0.94 0.927 0.992 0,994 0.992
MTV 0.99 0,994 0.513 0.555 0.63 0.654 0.989 0.984
As can be seen from Tables 2 and 3, the presence of BT for virtually all mobile applications and classification algorithms in question leads to significant errors. Only for some applications: Sberbank online; Wolfram; Goodwill losses were not so significant.
The best in terms of "Recall" and "Precision" are Random Forest and C4.5 classification algorithms. The worst are Naive Bayes and SVM algorithms. Table 4 summarizes the data for the "Accuracy" parameter for the algorithms considered for both sets of data.
Algorithm Random Forest Naive Baves SVM C4.5
Type of data sel Background Background Background Background
No Yes No Yes No Yes No Yes
Instagram 0.989 0.319 0.487 0. ! 33 0.387 0.122 0.99 0.261
Mail.ru 0.988 0.184 0.446 0.19 0.525 0.243 0.983 0.192
Skype 0.987 0.345 0.416 0.124 0.539 0.156 0.992 0.441
Sberbank online 0.973 0.692 0.557 0.123 0.647 0.171 0.972 0.337
Fishki.net 0.965 0.272 0.529 0.182 0.405 0.149 0.966 0.287
Hearthstone 0.987 0.364 0.343 0.129 0.652 0.254 0.973 0.27
Pikabu 0.978 0.383 0.485 0.147 0.56 0.169 0.963 0.413
Wo! I'ram 0.999 0.96! 0.901 0.702 0.16 0.692 I 0.994
Pi/j-a Sushi Vok 0.997 0.96 0,629 0.457 0.847 0.61 0.995 0.939
Moskovsky komsomolets ] 0.492 0.492 0.183 0.69 0.305 0.998 0.652
Godville 0.995 0.741 0.903 0.628 0.917 0.68 0.991 0.916
NTV 0.998 0.434 0,591 0.35 0.692 0.438 0.993 0.753
Table 4
Summary of the Accuracy parameter
Algorithm Random Forest Naive Bayes SVM C4.5
Type of data sei Background Background Background Background
No Yes No Yes No Yes No Yes
Accuracy 0.988 0.405 0.558 0.232 0.634 0.262 0.985 0.403
As can be seen, the appearance of BT leads to a significant decrease in the reliability of the mobile applications classification. The greatest resistance to the presence of background traffic on the "Accuracy" parameter is demonstrated by Random Forest and C4.5 algorithms. The worst are Naive Bayes and SVM.
The behavior of the metrics under consideration in BT conditions is intuitively clear. The problem of traffic classification using supervised machine learning involves the coincidence of classified applications and the number of classes. However, the appearance of BT violates this condition. The number of applications becomes larger than the a priori assumed number of classes. while the algorithms used mistakenly sort it into an already existing number of classes, which leads to a sharp decrease in the accuracy of the classification. There are several ways to solve the problem. One of them is the consideration of the problem of traffic classification using unsupervised machine learning, which assumes clustering of the observed traffic.
The second, simpler and easily implemented option is as follows. We introduce, in addition to the priori number of classes of interest, an additional class called "Unknown application". Training and testing of this additional class will be carried out in the same way as above. Expected, that the emergence of such a "virtual" class will solve the problem.
Classification results in the presence of background traffic and the class "Unknown application"
Let us consider classification of five applications listed in Table 1, which wc consider to be informative, and all remaining ones will be considered BT. Consider the results of the classification of these five informative mobile applications in the presence of BT and the class "Unknown application".
Table 5 summarizes the "Recall" parameter, and Table 6 summarizes the "Precision" parameter for the analyzed algorithms for each class and given data sets.
T
7TT
Y
Conclusions
it is shown that background traffic leads to a significant decrease in the quality of traffic classification of mobile applications. Only for individual applications: Sberbank online; Wolfram; Goodwill losses were not so significant. The decrease in accuracy for the considered applications and classification algorithms can reach on average 40%.
Random Forest and C4.5 algorithms demonstrate the greatest resistance to presence of BT on the "Accuracy" parameter. Naive Bayes and SVM demonstrate the smallest resistance.
The introduction of the additional class "Unknown application" has a positive impact on the quality of classification with BT. The increase in accuracy for the considered applications and classification algorithms due to the introduction of such a class can reach an average of 20% with a certain increase in the number of false positive solutions.
References
1. "Runel Web sites" - Site statistics II WEB: http^/www.liveinternetru/stat/ru/oses.html?slice=rus;i(i=2;id=l 5;id= 12; id=4;id=l I;id=checked;period=month.
2. She lull in O.I., Erohin S.D., Vanyushina A.V. (2018). Klassifikaciya IP-trafika metodami mashinnogo obucheniya. Moscow. 276 p.
3. Sheluhin O.I., Smychek M.A., Simonyan A.O, (2018). Fil'tratsiya nezhelatel'nykh prilozheniy internet resursov v tselyath informatsionnoy bezopasnosti. H-ES Research. Vol, 10. No. 2, pp. 87-98.
4. Sheluhin O.I., Smychck M.A., Simonyan A.G. (2018). Fil'tratsiya nezhelatel'nykh prilozheniy trailka podvizlinoy radiosvyzi dlya obnaruzheniyaugroz informatsionnoy bezopasnosti. Radioteklinicheskie i telekommumkatsionnye sistemy. No. 1, pp. 87-98.
5. Sheluhin O.I., Simonyan A.G., Vanyushina A.V. (2016). Effektivnost' algoritmov vydeleniya atributov v zadachakh klassifikatsii prilozheniypri Intellektual'nom analize trafíka // Elektrosvyaz'. No,11, pp. 79-85.
6. Sheluhin O.I., Simonyan A.G., Vanyushina A.V. (2017). Vliyanie struktury obychyushey vyborki na effektivnost' klassifikatsii prilozheniy trafika metodami mashinnogo obucheniya // T-Comm. Vol. 11. No.2, pp. 25-31,
7. Sheluhin O.I., Simonyan A.G., Vanyushina A.V, (2017). Benchmark data formation and software analysis for classification of traffic applications using machine learning methods. II T-Comm, vol. 11, no, I. pp. 67-72.
8. Quinlan, J- C4.5; Programs for Machine Learning, (¡993). Morgan Kaufmann Publishers.
9. Ho, Tin (Cam. Random Decision Forests (PDF). (1995). Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC. 14-16 August 1995.
10. Cortes C., Vapnik V. (1995). Support-vector networks. Machine Learning.
11. Witten 1. H. {Ian H.) Data mining: practical machine learning tools and techniques / Ian H. Witten, Eibe Frank. 2nd ed, p. cm. (Morgan Kaufmann series in data management systems), San Francisco 2005. P. 525. ISBN: 0-12-088407-0
12. Hn-Najjary T, Urvoy-Keller G., Pietrzyk M., and Costeux J.-L. (2010). Application-based feature selection for internet traffic classification. In Teletraffic Congress (1TC), 2010, 22nd International, pp. 1-8.
13. Pietrzyk M., En-Najjary T., Urvoy-Keller G., and Costeux J.-L. (2010). Hybrid traffic identification. Technical Report EURECOM+3075, Institut Eurecom, France, 04 2010.
Международная конференция
ИНФОФОРУ1У)
Доверие и безопасность КМТЯ1Л в информационном обществе Ivlrl I Qrl
International Conference
Confidence and Security in the Information Society à
Shanghai, Hangzhou Шанхай, Ханчжоу INFOFORUM China
January-February
Я31-1 %
января февраля ïï. - *-'>
K> 'Л-.-- •
Moscow Москва
II БОЛЬШОЙ национальный форум
ИНФОРМАЦИОННОЙ БЕЗОПАСНОСТИ
ИНФ<ЭФОРУМ2018
С RAND NATIONAL FORUM FOR INFORMATION SECURITY
INFOFORUM 2018
56
т
ВЛИЯНИЕ ФОНОВОГО ТРАФИКА НА ЭФФЕКТИВНОСТЬ КЛАССИФИКАЦИИ ТРАФИКА МОБИЛЬНЫХ ПРИЛОЖЕНИЙ МЕТОДАМИ ИНТЕЛЛЕКТУАЛЬНОГО АНАЛИЗА ДАННЫХ
Шелухин Олег Иванович, Московский Технический Университет Связи и Информатики, Москва, Россия,
Барков Вячеслав Валерьевич, Московский Технический Университет Связи и Информатики, Москва, Россия,
Аннотация
Рассматриваются особенности классификации трафика мобильных устройств по приложениям методами машинного обучения при наличии фонового трафика (ФТ), осуществляется поиск наилучшего и наихудшего алгоритмов классификации, проводится анализ наиболее подверженных влиянию фонового трафика типов приложений и алгоритмов классификации.
В качестве классифицируемых приложений выбран набор из двенадцати наиболее распространенных приложений, а в качестве фонового трафика использован дополнительный набор из шести приложений. В качестве алгоритмов классификации использовались широко распространенные алгоритмы машинного обучения: Naive Bayes, С4.5, Adaptive Boost, Random Forests (RF), Support Vector Machine (SVM). Для оценки эффективности алгоритмов классификации использовались метрики: Precision (Точность), Recall (Полнота), F-Measure (F-мера), ROC-кривые (Receiver Operating Characteristic Curve), AUC (Area Under Curve) - площадь под ROC-кривой. В результате обработки большого количества экспериментально полученных данных показано, что качество классификации в условиях наличия ФТ снижается для всех рассматриваемых алгоритмов классификации. Для повышения эффективности классификации в условиях фонового трафика предложено использовать дополнительный класс "Неизвестное приложение". Показан эффект от введения такого класса на качество классификации трафика мобильных приложений.
Ключевые слова: классификация, достоверность, DataMining, атрибуты, Random Forest, SVM, Naive bayes, С4.5, Adaptive Boost, метрики, протокол, поток, пакет, приложение.
Литература
1. Статистика сайта "Сайты Рунета" // Web: http://www.liveinternet.ru/stat/ru/oses.html?slice=rus;id=2;id= 15;id= 12;id=4;id= 1 1; id=checked;period=month.
2. Шелухин О.И., Ерохин С.Д., Ванюшина А.В. Классификация IP-трафика методами машинного обучения. Горячая линия - телеком, 2018. 276 с.
3. Шелухин О.И., Смычек М.А., Симонян А.Г. Фильтрация нежелательных приложений интернет ресурсов в целях информационной безопасности // Наукоемкие технологии в космических исследованиях Земли. 2018. Том 10. №2. С. 87-98.
4. Шелухин О.И., Смычек М.А., Симонян А.Г. Фильтрация нежелательных приложений трафика подвижной радиосвязи для обнаружения угроз информационной безопасности. Радиотехнические и телекоммуникационные системы, №1, 2018. С. 87-98.
5. Шелухин О.И., Симонян А.Г., Ванюшина А.В. Эффективность алгоритмов выделения атрибутов в задачах классификации приложений при интеллектуальном анализе трафика // Электросвязь. 2016. №11. С. 79-85.
6. Шелухин О.И., Симонян А.Г., Ванюшина А.В. Влияние структуры обучающей выборки на эффективность классификации приложений трафика методами машинного обучения // T-Comm: Телекоммуникации и транспорт. 2017. Том 11. №2. С. 25-31.
7. Sheluhin O.I., Simonyan A.G., Vanyushina A.V. (2017). Benchmark data formation and software analysis for classification of traffic applications using machine learning methods. // T-Comm, vol. 11, no.1, pр. 67-72.
8. Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
9. Ho, Tin Kam. Random Decision Forests (PDF). Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14-16 August 1995.
10. Cortes C., Vapnik V. Support-vector networks. Machine Learning. 1995.
11. Witten I. H. (Ian H.) Data mining: practical machine learning tools and techniques / Ian H. Witten, Eibe Frank. 2nd ed. p. cm. (Morgan Kaufmann series in data management systems), San Francisco 2005. Р. 525. ISBN: 0-12-088407-0
12. En-Najjary T, Urvoy-Keller G., Pietrzyk M., and Costeux J.-L. Application-based feature selection for internet traffic classification. In Teletraffic Congress (ITC), 2010, 22nd International, pages 1 - 8, 2010.
13. Pietrzyk M., En-Najjary T., Urvoy-Keller G., and Costeux J.-L.. Hybrid traffic identification. Technical Report EUREC0M+3075, Institut Eurecom, France, 04 2010.
Информация об авторах:
Шелухин Олег Иванович, профессор, д.т.н., заведующий кафедрой "Информационная безопасность", Московский Технический Университет Связи и Информатики, Москва, Россия
Барков Вячеслав Валерьевич, Старший преподаватель кафедры "Информационная безопасность", Московский Технический Университет Связи и Информатики, Москва, Россия
( I л