Научная статья на тему 'Формирование исходных данных и анализ программного обеспечения для классификации приложений трафика методом машинного обучения'

Формирование исходных данных и анализ программного обеспечения для классификации приложений трафика методом машинного обучения Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
518
84
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
КЛАССИФИКАЦИЯ IP-ТРАФИКА / СНИФЕР / УТИЛИТЫ / TRECEDUMP / МЕТОДЫ МАШИННОГО ОБУЧЕНИЯ / ПРОТОКОЛЫ FTP / HTTP / SSH / SKYPE / ФИЛЬТРАЦИЯ ПОТОКОВ ТРАФИКА / АТРИБУТЫ КЛАССИФИКАЦИИ

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Шелухин Олег Иванович, Симонян Айрапет Генрикович, Ванюшина Анна Вячеславовна

Рассмотрены особенности формирования экспериментальной базы данных на основе захвата трафика с помощью снифферов. Проведен сравнительный анализ возможностей и характеристик различных сниферов таких как tcpdump, совместно с библиотекой libcap, wireshark, libtrace. Рассмотрены особенности программных утилит Gt, nDPI, Tracedump. Показаны достоинства использования утилиты tracedump в качестве пакетного снифера позволяющего осуществлять перехват данных от единственного приложения. Рассмотрены архитектура и реализация программного обеспечения для формирования исходных данных в задачах классификации приложений трафика, на базе консольной утилиты tracedump, работающей, под ОС Linux. Представлены результаты применения предложенного ПО. С этой целью была использована сеть, состоящая из локальной рабочей станции под управлением Windows 10, и виртуальной рабочей станцией, необходимой для функционирования программы tracedump. Для целей классификации перехватывался трафик WEB (http, https), mail (smtp,imap), Ftp (Ftp-data, Ftp-commands), SSH, Skype, P2P. На основе сравнительного анализа существующих сниферов Wireshark, Libtrace, утилит Gt, nDPI, показаны достоинства использования утилиты tracedump в качестве пакетного снифера, позволяющего эффективно перехватывать данные. Предложена архитектура и реализовано ПО для формирования исходных данных в задачах классификации различных приложений. На базе реализованного программно-аппаратного комплекса создана экспериментальная база данных для задачи классификации приложений трафика WEB (http, https); mail (smtp, imap), Ftp (Ftp-data, Ftp-commands); SSH, Skype, P2P. Для корректной реализации машинного обучения решена задача оптимизации числа атрибутов, с использованием алгоритма InfoGain. В результате общее количество из 21 атрибутов сокращено до 14.

i Надоели баннеры? Вы всегда можете отключить рекламу.

Похожие темы научных работ по компьютерным и информационным наукам , автор научной работы — Шелухин Олег Иванович, Симонян Айрапет Генрикович, Ванюшина Анна Вячеславовна

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Формирование исходных данных и анализ программного обеспечения для классификации приложений трафика методом машинного обучения»

BENCHMARK DATA FORMATION AND SOFTWARE ANALYSIS FOR CLASSIFICATION OF TRAFFIC APPLICATIONS USING MACHINE LEARNING METHODS

Oleg I. Sheluhin,

Head of department of Information security, professor, D.Sc, MTUCI, Moscow, Russia, sheluhin@mail.ru

Ayrареt G. Simonyan,

Assistant professor, Candidate of texn. science, MTUCI, Moscow, Russia, blackman-05@mail.ru

Anna V. Vanyushina,

Senior lecturer MTUCI, Moscow, Russia, vanuanna@rambler.ru

Keywords: classification of IP-traffic, sniffer, utilities, trecedump, machine learning methods, Ftp protocols, http, SSH, Skype, traffic flow filtering, classification attributes.

Formation features of an experimental database based on a traffic capture using sniffers have been examined. A comparative analysis of the features and characteristics of different sniffers such as tcpdump, together with lib-cap, wireshark and libtrace libraries have been carried out. Gt, nDPI and Tracedump software utilities features have been reviewed. Tracedump advantages when used as a package sniffer enabling to intercept data from a single application have been shown. The architecture and software implementation for the benchmark data formation in terms of classifying traffic applications, based on tracedump console utility under Linux OS have been reviewed.

The results of applying the mentioned software have been provided. For this purpose, a network of local workstation running Windows 10, and a virtual workstation, necessary for the functioning of tracedump were used. For the classification purposes WEB (http, https), mail (smtp, imap), Ftp (Ftp-data, Ftp-commands), SSH, Skype, P2P traffic was intercepted. Based on a comparative analysis of existing sniffers like Wireshark, Libtrace, Gt utilities, nDPI, the advantages of tracedump utility have been shown when used as a packet sniffer, which allows to efficiently capture the data. An architecture has been proposed on a par with the implementation of the software for the benchmark data formation for the purpose of different applications classification.

Based on the implemented hardware and software, an experimental database was built up for the purpose of WEB (http, https); mail (smtp, imap), Ftp (Ftp-data, Ftp-commands); SSH, Skype, P2P traffic classification. For the proper machine learning process, a problem of a number of attributes optimization has been solved, using InfoGain algorithm. As a result, the total number of 21 attributes was reduced to 14.

Для цитирования:

Шелухин О.И., Симонян А.Г., Ванюшина А.В. Формирование исходных данных и анализ программного обеспечения для классификации приложений трафика методом машинного обучения // T-Comm: Телекоммуникации и транспорт. 2017. Том 11. №1. С. 67-72.

For citation:

Sheluhin O.I., Simonyan A.G., Vanyushina A.V. (2017). Benchmark data formation and software analysis for classification of traffic applications using machine learning methods. T-Comm, vol. 11, no.1, рр. 67-72.

Problem statement

Traffic classification is used in a greal nnmber of operations, ll is usually executed in accordance with some set of predetermined characteristics. Herewith, the traffic is divided into big classes (Chat, Interactive, VoIP, etc.), or nicely granulated classes according to the protocol (FTP, HTTP, SSH, etc). Once the classification process is finished, all the traffic should be correlated with appropriate classes. Most often used traffic classification methods based on the known port number and network packets payload research are limited to a certain degree. To overcome this obstacle, (he statistical methods of network traffic patterns recognition are used.

The statistical methods approach relies on traffic's statistical characteristics for the purpose of application identification ¡1]. The assumptions underlying these methods are based on the fact, that the network traffic has statistical characteristics, that are unique for certain application classes and allow to distinguish different benchmark applications [2].

One of the main problems to be solved when developing the systems of traffic classification using machine learning algorithms is a formation of the benchmark data and a software used in this process. Benchmark data represents traffic samples, that are classified by applications in which these samples were generated. At the moment, there is no such set of benchmark data that could be called a standard in traffic classification discipline, as well as there is no universal approach towards its capturing. At the same time, the accuracy of machine learning algorithms depends directly on size, quality and representativeness of benchmark dataset, used in the process of machine learning. That is why obtaining tilting benchmark dataset is an important goal.

111 connection with mentioned above, the aim of the article is to compare and analyze an existing software and to choose the structure of software and hardware far the purpose of capturing and classifying the traffic based on the optimization of a number of classification attributes.

Traffic capture methods

It is time to examine the features of obtaining the benchmark data, based on the traffic capture, using sniffer - a special software, or hardware and software system, used for traffic interception. Besides the basic functionality, many snifters offer such features as traffic flow filtering, TCP session recovery, etc. Sniffer is one of the most significant utilities for diagnosing and maintaining the network. The information obtained on sniffer's output gives an idea of the network functioning. For example, in order to diagnose routing problems, it is required to analyze IP packets on router's input and output network interfaces. If the network traffic is detected on the input interface and not detected on the output interface, the routing process is badly configured.

It is possible to intercept the traffic using the following methods;

• Network interface listening. This method is effective when using hubs instead of switches, because otherwise, the interception will run partially, capturing particular protocol frames.

• Retranslating the traffic on sniffer using software or hardware (Network tap [3]).

• Placing sniffer directly into the channel.

• Capturing and analyzing side electromagnetic radiation generated by the network equipment with further recovery of the initial traffic.

• Executing MAC-spoofing or IP-spoofing attack, that retranslate the victim's traffic and/or the whole segment on the sniffer and sends it back where it came from.

There is a lot of approaches to getting the benchmark data, that can be conditionally distinguished this way:

• Using sniffer (Wireshark for example) 151. The idea behind this approach is to run only one application at a lime so that the intercepted packets belong to the running application. This method has proven itself extremely unreliable and slow because all the operating systems generally run background processes that use network. The traffic generated by these processes could be mistakenly associated with the running application and that will subsequently affect the traffic classification process.

• Using deep packet inspection (DPI) software in order to get fitting benchmark data [6]. The following utilities and libraries can be used to achieve this goal: L7-lilter, PACE, nDPl, OpenDPI, etc. This method is far more convenient, than the previous, as unlike the previous, it allows to inspect the data everywhere in the network. However, the existing DPI tools are not able to classify the traffic of some applications precisely (Skype for example). The other disadvantage of this method is that it is slow and uses a lot of resources.

• Using special utilities (tracedump |7|, gt [8|), that allow^ to capture the traffic generated by one application only. This method is a modification of the first method.

Software analysis, used to formulate the benchmark data

Tcdump, on a par with ¡¡heap |9] library, is one of the most popular packet sniffers, running on most UNIX-like operating systems - for example, Linux, BSD, Solaris, Windows. Tcdump, having been used in numerous researches involving computer networks, is practically a golden standard utility for capturing the traffic. PCAP format, which is the most commonly used file format for capturing and storing packets, was established owing to tepdump [I0|. Tepdump introduces the filtering mechanism, that allows to capture packets, that lit certain criteria - for example. TCP packets with fixed destination port number 80. Unfortunately, this mechanism does not allow to capture packets, generated by one single application only, especially if the currently running process is a peer-to-peer application, that dynamically reassigns ports every few seconds.

One more popular sniffer is called Wireshark |5J, that has a user-friendly graphical interface and offers a lot of advanced settings and features. Libtrace [11] sniffer is designed to eliminate the libeap library disadvantages. It also supports numerous ways and formats of input and output and insures a high performance. Nevertheless, none of the listed above programs allow to capture the traffic, generated by one single application,

Gt utility. In terms of traffic classification, there are two software utilities, that are used. Gt [8| utility is a system, that allows to capture the traffic, where application flows arc automatically associated with applications, that generated these flows. This is how this program works. On the first stage, every single host inside a particular network sends the list of his own connections together with the names of the applications that initialed these connections, on a border router. On the second stage, the selected router captures all the incoming and outgoing traffic inside a particular network and classifies the traffic flows, using the list granted on the first stage, allowing to get the benchmark data. Here is offered a similar approach, except that it is for use on machines running Windows operating system. However, this utility does not solve the problem of a proper benchmark data formulation for machine learning cither, as there is an additional output results processing needed and the traffic of one single application cannot be captured in real-time. Besides, software products tend to drop a certain number of packets at the start of

connection establishment because it takes time lor the list of open network connections to form and update.

nDPI utility. This utility is a commercial classifier with deep packet inspection OpenDPI. Since OpenDPI project is officially closed, the nDPI utility is its one and only alternative [6]. nDPI uses a duplex structure as its predecessor. The first level controls packet processing, 3-4 levels decoding for basic packet info extraction such as IP addresses and active ports. The second level holds processing modules. At the moment, there is more than 170 of them, that allows to detect a corresponding number of applications [13], Unlike most of the classifiers with deep packet inspection, nDPI uses different methods. Apart from a signature analysis, there are behavioral and statistical methods, that are being used. Deep packet inspection process cannot be pieked for encrypted traffic analysis, as the only information that is transmitted openly is transmitted during the connection establishment and keys exchange processes. nDPI uses SSL decoder to identify host server name, that can tell of the type of the traffic being transmitted. For example, twitier.com will get a Web(Twitter) class. The disadvantage of nDPI is a low performance. Although the processing modules are designed using C, every single packet w ill be processed by every single module, regardless of a match being found or not [14],

Tracedump utility. It is hard to distinguish only those packets. that were generated by a particular program, i.e. sent and received by one process, when running sniffer on a particular machine, ll is due to the fact, that the most popular sniffers were designed for an internet roister use and cannot generate their own traffic. A typical packet sniffer is able to intercept packets on a particular network interface, however, this is not enough to read packets of a particular process, run on a local machine. This specialty can be explained by the fact, that such a feature is not implemented in the operating system's kernel. For instance, Linux's kernel lacks the ability to intercept packets generated by a particular process. Tracedump is a sniffer, that was designed to solve this problem. It allows to intercept packets generated by one particular application. In order to achieve this, a few methods compensating lacking mechanisms inside Linux's kernel, such as system call piraee (2) 115| and socket filter BPF (Berkeley Packet Filler) [16], are used.

Thereby, tracedump utility is a packet sniffer, allowing to intercept the data from one single application. The results arc additionally filtered in order to get rid of the background system packets. Thanks to the program's architecture and the way it operates, this program allows to get accurately classified data, that can be used ¡n learning algorithms and analyze the results using machine learning methods.

The architecture and software implementation

for the benchmark data formulation

Tracedump is a console utility, run on 32-bit operating systems of Linux origin. Since tracedump is an open software |17|, there are a few modifications, initial program forks that offer features, that are not included in the original application. These modifications are designed and supported by independent developers. Now it is time lo examine the features of fork program called cracedump64[ 18], that allows lo run a program in 64-bii and demonstrates increased reliability in comparison to the original program.

in order to examine the architecture of this program, an order in which TCP connection in Linux is opened and transmitted should be analyzed. An operating system provides API for the Internet connections using system calls - socket, connect h send.

First, the application uses a socket function to initiate a unique connection. Then, a connect function with a remote host's address is called and after that, a system call send can be used for the data transmission.

Two things should be kepi in mind when designing a packet sniffer for one application.

Firstly, (A) — the application is not involved in headers forming process on network and transport layers - it should be done by the operating system. Therefore, intercepting the data, transmitted as arguments in system call, is not enough.

Secondly, (B) - system call connect, designed for establishing the connection, will generate a couple of packets before the connection is actually established. As a result, sniffer must begin the packets capturing process before the call is processed in the operating system's kernel.

Tracedump is divided into three functional modules in the form of streams; ptrace, pcap and GC- garbage collector. Ptrace module connects to all chosen process's streams and, using ptrace function, creates a list of all the local TCP and UUP porls, opened by the application. Pcap module operates as a packet sniffer, intercepting all the packets from all the network interfaces on the kerne! level, solving problem (A), mentioned above, livery time ports list is changed, a BPF filter is applied to a pcap capturing module the way thai all the packets, that do not belong to the application that is currently under examination, are ignored. The BPF filter is updated before the operating system's kernel has processed the system call, solving problem (B). The garbage collector's quest is lo delect ports that are not under use ai the moment. Every minute ii creates a list of all opened system connections and updates I he list, received by plrace module. The program's architecture is presented in pic. 1.

Picture 1. Tracedump architecture

The ports list is created using operating system's kernel and is used to capture the packets. The garbage cleaner periodically cleans up the list.

Ptrace module traces only three system calls: bind, connect. and sendto. The results of module's processing are presented in pie.2. By the source code of Linux's kernel analysis, il has been proven, that the suggested program's architecture is good enough to guarantee that not a single packet is lost. For UDP and TCP servers, the application should use bind for the local port number configuration. For the client side programs, connect and sendto should be called. Herewith, it is possible, that the local port is not yet assigned and the kernel will complete the assigning operation automatically. However, due to the (B) limitation, this situation is considered to be undesirable, in this case tracedump breaks up the system call, forcing to use bind first. This is implemented by injecting the machine code into process.

™ vUdstttobuntu: -/Desktop

vlads®ubuntu:~/DesktopS sudo tracedunp ctorrent ubuntu-15. lose rver-1386.iso.torrent pcap_lnlt(): writing packets to dump.pcap ptrace_attach_child(): attached to PID 4599 (ctorrent) META INFO

Announce: http://torrent.ubuntu,con:6969/awiounce Alternates:

1. http://lpv6.torrent.ubuntu.com:6969/announce Created On: Thu Oct 22 02:36:57 2615 Piece length: 524288 Confient: Ubuntu CD releases.ubuntu.con FILES INFO

<1> ubuntu-15.16-server-1386.iso [658505728] Total: 628 HS

Creating file "ubuntu-15.10-server-i386.iso" Listening on 0.6.0.6:2706

Picture 2. Tracedump processing with Ubuntu client download using P2P network, shown as an example. All the captured packets will belong to the ctorrent client

The results of software implementation

To get the required benchmark data, a network presented in pic.3 has been used. It consists of one computer under Windows 10 - a local work station with a virtual machine on a board that was used to nin a guest operating system Ubuntu 16.04 - a virtual work station, designed for the proper functioning of the tracedump. PostgreSQL database server was additionally deployed on a local work station for the purpose of further functioning of the traffic classification system in both realtime and autonomous modes, i.e. during the classification of already captured and saved in .PCAP file format traffic. The connection to the Internet was established through the router.

Local work station under

Pictlire 3. Network topology, in which the data preparation and analysis was done

The following traffic was being intercepted;

• FTP server's purpose was to capture the traffic of FTPdata and FTP-command protocols. During the downloading process, a passive mode was chosen to be used with port 21 for commands and port 20 for data transmission. The traffic of these protocols was united into the FTP class;

• Web server was used for capturing the HTTP and 1ITTPS traffic usually using ports 80 and 443. The dump got a WEB class.

Picture 3 - network's topology in which the preparation of tiie data and its analysis was carried out.

• Mail server w?as used for capturing SMTP and IMAP traffic. For SMTP and IMAP applications, different mail services and mail clients were used, as the implementation of the interaction with the protocol within the standard may differ [IMAP, SMTP|. Assigned class - "MAIL";

• Management server allows to control a massive or a cluster of physical or virtual servers. The interaction with the server was carried through using SSII protocol. Its interception was done by capturing the traffic of a terminal application on the virtual work station. Assigned class - «SSFI»;

• Remote work station was used for capturing Skype traffic. Skype uses both TCP (for establishing a secure connection) and UDP (for audio and video calls) transport protocols. In case of iack of trust on the other side, the latest versions of Skype allow to carry out all the connections forcibly using TCP. The traffic was captured in all modes during both audio and video calls. Assigned class - «Skype».

Traffic dumps, created after the measurements, are listed in table 1.

Table 1

Number of traffic streams sorted by application classes

Number of network traffic streams sorted by groups Total number of streams

WEB (HTTP, H FTPS) MAIL (SMTP, !MAP> FTP (FTP-DATA, 1 IT-COMMANDS) SSH SKYPE Peer-to-Peer (P2P)

5054 3360 3470 5824 3737 6992 28437

Types of classification attributes.

The benchmark data for the research is presented by sequences of IP packets, gathered in the spots of the observation. The observation unit is a stream. IP packets sequence can be bilateral or unidirectional sequence of packets between two IP addresses, full TCP session or unidirectional sequence of IP packets, determined based on five header fields <src_ip, srcjDort, dst_ip, dstjDort, protocol> and rules of forming, that indicate the stream's ending (usually time out or Hag iLEND" in the packet's header). Flere, sre ip is a source IP address; sre_por! is a source port; dstip is a destination IP address; dsl_port is destination port; protocol is a transport protocol. Usually. TCP and UDP are used as transport layer protocols. A set of attributes, based on statistical characteristics, such as packet size or intervals between packets, and characteristics, extracted from packet's headers, such as TCP segment size or a number of retransmissions was fixed. The stream was assigned to a set of attribute's values, that were used in the classification process.

For the proper implementation of machine learning, a couple of problems should be solved, which are: a problem of choosing optimal attributes and shortening the number of them, as well as examining the ways of implementing the clustering methods for distinguishing individual groups of protocols. The parameters of the network protocols of transport and network layers were chosen to be the benchmark attributes and are listed in table 2.

During the initial attributes choice, it was decided to use a maximum number of attributes, whose values one way or another depend on the type of application that generated these values. However, the researches have shown, that it is unnecessary to use all of them in the classification model, as the profit from their usage is minor. Kor attributes distinguishing purposes, a ranging method, based on InfoGain 42 was used [191.

Table 2

Initial classification attributes

Attribute Rescript ion

classname Name of a protocol class (WRB, MAIL, FTP, etc.) of a classified traffic that will be used during the creation of a classifier modet

tot pkls qty Total number of packets in a current stream in both directions.

tot_pkts_bytes Total size in bytes of all packets in a current stream in bolh directions.

rev_pkls_qly Number of stream packets in reverse direction if the stream is bilateral.

rev pkts bytes Size in bvtes of a [1 packets in stream in the reverse direction.

fw pkls qty Number of stream packets in the forward direction.

frv pkts bytes Size in bvles of a [1 stream packets in the forward direction

is re ver sable liool variable, indicating if a cunent stream is bilateral.

transport protocol Transport layer protocol {TCP, or UDP)

sre port t ransport layer source port (for TCP and UDP)

dst port Destination pon

wirelen Initial length of all stream packets in physical channel, divided by tola! number of packets

header count Number of all headers of all packcts, divided by number of packets

tcpsyn Percentage of packets with SYN tlag of TCP protocol. In case UDP is used, the value is GAP - a distance in bytes between header and packet's payload, divided by number of packets.

tcp ack Percentage of packets with ACK flag of TCP protocol, for UDP. GAP OFFSET is used - distance from the packet's start to the end of the headers, divided by number of packets.

flays Average number of TCP protocol flags, for UDP, an average number of headers is used.

pay load length Average size of a payload of a transport ¡aycr protocol in stream.

tcp flow dir Bool variable, defining stream's direction. It' 1, stream is outgo-ilia, if-1 - incoming.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

is fragment Percentage of fragmented streams.

h lea Number of IP protocol headers.

pay load offset Average distance from the packet's start to payload.

- PfiAl;li_rcvcisabkj 0,03?

1,'p ;yil Ü.ÜL4 '- flow dir;

«15Î Wi

|T,iniport_prelOCOlr ■

■ Рид!, lot вМЧ'У.

047S

^^^^^^^^^^^^^ » Р«д1;кр_агк; 1,171

■ ^-u. 1 'j, pkb *tîr»Â ^^^^^^^^^^^^^^^^^^ il r'UÙ hv.i-cr^W-:

IMl

-I-■ ■ Ряд1: wlrtltrcl.MJ

^ в pnul.toL pb L"rl'">.

Un

» PO.il lljfl 1.7Ы

wrjjort, Mgnrti IA1)

Picture 4. The results of InfoGain algorithm implementation

The results of InfoGain algorithm implementation using 10-x cross check are presented on picture 4. Attributes are sorted out from the worst to the best.

As seen from the results, attributes of source and destination ports provide a lot of information gain and they should be used for the traffic classification using statistical methods.

After 14-th rank, there is a fall in the information gain. Having analyzed the values and variation range, it was found, that attributes tcpjsytl, tcpjlowdir, is_reversable have a minor information gain and should not be used in the future.

Conclusions

Based on the comparative analysis of existing snifters — Wireshark, Libtrace, utilities - Gt, nDPI, the advantages of tracedump utility as a packet sniffer, allowing to intercept the data effectively, have been demonstrated. The architecture has been offered and the software for forming initial data in terms of different applications classification lias been implemented.

An experimental database for application traffic classification purposes (WEB (http. https); mail (smlp, imap), Ftp (Ftp-data, Ftp-commands); SSH, Skype, P2P) based on the software and hardware complex, has been created. The optimization of the number of attributes in terms of classification of applications, that allowed to reduce a total number of attributes from 21 to 14, has been carried out using InfoGain algorithm.

References

L Karagiannis, Т., Papagiannaki, K. and Faloutsos, M. (2005), Blinc: Multilevel traffic classification in the dark. In Proceedings of ACM SIGCOMM, Philadelphia, PA.

2. Sherbakova, N.G. (2012), "IP-traffic analysis using Data Mining methods". Problemi mformutiki, no. 4. pp. 30-46.

3. Netresec - Intercepting Network Traffic Using Network Tap, available at: htlp://www.netresec.coni/?paiie=Biog&month=20l l-0 3 & post=Sn i ffuig- T u to r i a I - pa rt -1 — Interce pting-Network-Traffic.

4. Атака канального уровня ARP-spoofing и как защитить коммутатор Cisco, available at: https://habrahabr.ru/post/192022/.

5. Wireshark, available at: https://www.wireshark.org/.

6. nDPI: Open-Souree High-Speed Deep Packet Inspection, available at: http://luca.ntop.org/nDPI.pdf

7. Foremski, P (2012), "Tracedump: A Novel Single Application IP Packet Sniffer1', Theoretical and Applied Informatics, vol. 24, no. I, pp. 23-31.

8. Gringoli, F,, Salgarelli, L., Dusi, M., Cascarano, N., Risso, F. and Clafiy, К (2009), "Gt: Picking up the truth from the ground for internet traffic", ACM SIGCOMM Computer Communication Review, vol. 39, no. 5, pp. I3-I8.

9 Tcpdump, available at: http://www.tcpdump.org.

10. Degioanni, L„ Risso, F. and Varenni, G. (2004), PCAP Next Generation Dump File Format. IETF, Internet-Draft PCAP-DumpFileFormat. http://www.ietf.org/ietf/iid-abstracts.txt.

11. Alcock, S., Lorier, P. and Nelson, R., (2012) Libtrace: A Packet Capture and Analysis Library. ACM SIGCOMM Computer Communication Review-. 42(2), pp. 42-48.

12. Szabo, G„ Orincsay, D„ Malomsoky, S. and Szabo, I. (2008), On the validation of traffic classification algorithms. Proceedings of PAM'OS, Springer-Verlag.

13. De Sensi, D., Danelutto, M. and Deri, L. (2012), Dpi over commodity hardware: implementation of a scalable framework using fast-flow. Master's thesis, Universita di Pisa, Italy.

14. Bujlow, T. (2014), "Independent Comparison of Popular DPI Toold for Traffic Classification", available at: http://tomasz.bujlow.com/ publications/2014journal_elsevier_comnet_independent_comparison.pdf,

!5. ptrace(2), available at: http://www.kernel.org/doc/man-pages/ о n I i ne/pages/m an2/pi гас e. 2. ht m I.

16. McCanne, S. and Jacobson, V. (1993) "The BSD packet filter: a new architecture for user-level packet capture", USENIX Winter 1993 Conference Proceedings (USENVC93). Januaiy 25-29, 1993, San Diego, CA.

17. Gitliub tracedump, available at: https://gitliub.com/iitis/tracedump.

18. Github tracedump64, available ai: https://giihub.com/crunchi-ness/Tracedum p64.

19. Andrew W. Moore, Information Gain tutorial, available at: http://www.autonlab.org/tutorials/iiifogainl 1 .pdf.

ИНФОРМАТИКА

ФОРМИРОВАНИЕ ИСХОДНЫХ ДАННЫХ И АНАЛИЗ ПРОГРАММНОГО ОБЕСПЕЧЕНИЯ ДЛЯ КЛАССИФИКАЦИИ ПРИЛОЖЕНИЙ ТРАФИКА МЕТОДОМ МАШИННОГО ОБУЧЕНИЯ

Шелухин Олег Иванович, Московский Технический Университет Связи и Информатики, Заведующий кафедрой "Информационная безопасность", профессор, д.т.н., Москва, Россия, sheluhin@mail.ru

Симонян Айрапет Генрикович, Московский Технический Университет Связи и Информатики, Доцент кафедры "Информационная безопасность", доцент, д.т.н., Москва, Россия,

blackman-05@mail.ru

Ванюшина Анна Вячеславовна, Московский Технический Университет Связи и Информатики, Старший преподаватель кафедры"Информационная безопасность", Москва, Россия

vanuanna@rambler.ru

Аннотация. Рассмотрены особенности формирования экспериментальной базы данных на основе захвата трафика с помощью снифферов. Проведен сравнительный анализ возможностей и характеристик различных сниферов таких как tcpdump, совместно с библиотекой libcap, wire-shark, libtrace. Рассмотрены особенности программных утилит Gt, nDPI, Tracedump. Показаны достоинства использования утилиты tracedump в качестве пакетного снифера позволяющего осуществлять перехват данных от единственного приложения. Рассмотрены архитектура и реализация программного обеспечения для формирования исходных данных в задачах классификации приложений трафика, на базе консольной утилиты tracedump, работающей, под ОС Linux. Представлены результаты применения предложенного ПО. С этой целью была использована сеть, состоящая из локальной рабочей станции под управлением Windows 10, и виртуальной рабочей станцией, необходимой для функционирования программы tracedump. Для целей классификации перехватывался трафик WEB (http, https), mail (smtp,imap), Ftp (Ftp-data, Ftp-commands), SSH, Skype, P2P. На основе сравнительного анализа существующих сниферов Wireshark, Libtrace, утилит Gt, nDPI, показаны достоинства использования утилиты tracedump в качестве пакетного снифера, позволяющего эффективно перехватывать данные. Предложена архитектура и реализовано ПО для формирования исходных данных в задачах классификации различных приложений. На базе реализованного программно-аппаратного комплекса создана экспериментальная база данных для задачи классификации приложений трафика WEB (http, https); mail (smtp, imap), Ftp (Ftp-data, Ftp-commands); SSH, Skype, P2P. Для корректной реализации машинного обучения решена задача оптимизации числа атрибутов, с использованием алгоритма InfoGain. В результате общее количество из 21 атрибутов сокращено до 14.

Ключевые слова: классификация IP-трафика, снифер, утилиты, trecedump, методы машинного обучения, протоколы Ftp, http,SSH, Skype, фильтрация потоков трафика, атрибуты классификации.

Литература

1. Karagiannis T., Papagiannaki K., Faloutsos M. Blinc: Multilevel traffic classification in the dark / In Proceedings of ACM SIGCOMM, Philadelphia, PA, August, 2005.

2. Щербакова Н.Г. Анализ IP-трафика методами Data Mining // Пробл. информатики. 2012. №4. С. 30-46.

3. Netresec - Intercepting Network Traffic Using Network Tap // Web: http://www.netresec.com/?page=Blog&month=20ll-03&post=Sniffing-Tutorial-part-l —Intercepting-Network-Traffic.

4. Атака канального уровня ARP-spoofing и как защитить коммутатор Cisco // Web: https://habrahabr.ru/post/l92022/.

5. Wireshark // Web: https://www.wireshark.org.

6. nDPI: Open-Source High-Speed Deep Packet Inspection // Web: http://luca.ntop.org/nDPI.pdf.

7. Foremski P. Tracedump: A Novel Single Application IP Packet Sniffer // Theoretical and Applied Informatics, 2012, vol. 24, no. 1, pp. 23-31.

8. Gringoli F., Salgarelli L., Dusi M., Cascarano N., Risso F., Clafiy K. Gt: Picking up the truth from the ground for internet traffic // ACM SIGCOMM Computer Communication Review, 2009, vol. 39, no. 5, pp. 13-18.

9. Tcpdump // Web: http://www.tcpdump.org/.

10. Degioanni L., Risso F., Varenni G. PCAP Next Generation Dump File Format, IETF, Internet-Draft PCAP-DumpFileFormat, 2004.

11. Alcock S., Lorier P., Nelson R. Libtrace: A Packet Capture and Analysis Library.

12. Szabo G., Orincsay D., Malomsoky S., Szabo I. On the validation of traffic classification algorithms, Proceedings of PAM'08, Springer-Verlag, 2008.

13. De Sensi D., Danelutto M., Deri L. Dpi over commodity hardware: implementation of a scalable framework using fastflow, Master's thesis, Universit? di Pisa, Italy, 2012.

14. Tomasz Bujlow. Independent Comparison of Popular DPI Toold for Traffic Classification/Web:http://tomasz.bujlow.com/publications/20l 4_jour-nal_elsevier_comnet_independent_comparison.pdf.

15. ptrace(2)//Web:http://www.kernel.org/doc/man-pages/online/pages/man2/ptrace.2.html.

16. McCanne S., Jacobson V. The BSD packet filter: a new architecture for user-level packet capture, USENIX Winter l993 Conference Proceedings (USENIX'93), l993.

17. Github tracedump // Web: https://github.com/iitis/tracedump .

18. Github tracedump64 // Web: https://github.com/crunchiness/Tracedump64.

19. Andrew W. Moore, Information Gain tutorial // Web: http://www.autonlab.org/tutorials/infogainll.pdf.

i Надоели баннеры? Вы всегда можете отключить рекламу.