Научная статья на тему 'THE USE OF ARTIFICIAL INTELLIGENCE IN SYSTEMS FOR IN-DEPTH ANALYSIS OF NETWORK TRAFFIC'

THE USE OF ARTIFICIAL INTELLIGENCE IN SYSTEMS FOR IN-DEPTH ANALYSIS OF NETWORK TRAFFIC Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
6
2
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
deep traffic analysis / computer network / traffic encryption / VPN / neural traffic analysis / random trees committee

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Kononov V.V.

The relevance of research is explained by the need to improve the network traffic analysis systems, including deep analysis systems, taking into account existing threats and vulnerabilities of network equipment and software of computer networks based on methods and algorithms of machine learning.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «THE USE OF ARTIFICIAL INTELLIGENCE IN SYSTEMS FOR IN-DEPTH ANALYSIS OF NETWORK TRAFFIC»

УДК 004.418

Kononov V.V.

Lipetsk State Pedagogical University (Lipetsk, Russia)

THE USE OF ARTIFICIAL INTELLIGENCE IN SYSTEMS FOR IN-DEPTH ANALYSIS OF NETWORK TRAFFIC

Abstract: the relevance of research is explained by the need to improve the network traffic analysis systems, including deep analysis systems, taking into account existing threats and vulnerabilities of network equipment and software of computer networks based on methods and algorithms of machine learning.

Keywords: deep traffic analysis, computer network, traffic encryption, VPN, neural traffic analysis, random trees committee.

The term "deep packet inspection" (DPI) [1] refers to the analysis of a network packet at the upper levels (application and presentation level) of the Open Systems Interaction Model (OSI) [2].

In addition to analyzing network packets [3] using standard templates for certain parameters that can be used to unambiguously determine whether a packet belongs to a specific application, for example, by header format, port numbers, etc., the DPI system performs behavioral traffic analysis. This allows you to recognize applications that do not use known data headers and data structures for data exchange.

For identification, a sequence of packets with the same characteristics is analyzed. Analyzed characteristics: Source_IP: port - Destination_IP: port, packet size, frequency of opening new sessions per unit of time, etc. The analysis is based on behavioral (heuristic) models corresponding to such applications.

The main component of the DPI solution [4] is the classification module. It is responsible for the classification of network flows. The classification can be performed with varying accuracy depending on the purpose of the DPI application:

type of protocol or application (e.g. Web, P2P, VoIP),

specific application layer protocol (HP BitTorrent, SIP),

applications using the protocol (Google Chrome, uTorrent, Skype).

Traffic analysis using traditional tools becomes impossible without selecting a key for encrypted data streaming (for example, TLS/SSL protocols). It takes a lot of resources to find the key. The relevance of hacking remains only at the government or military level.

Therefore, the development of algorithms that allow classifying the traffic of secure connections with the required level of detail according to the protocol is relevant.

Traffic classification allows you to identify various applications and protocols transmitted over the network. Also, the classification function is the management of this traffic, its optimization and prioritization. After classification, all packets are marked as belonging to a specific protocol or application. This allows network devices to use a quality of service (QoS) policy based on these labels and flags.

There are two main methods for classifying traffic:

1. Classification based on data blocks (Classification based on payload). It is based on the analysis of data packet fields. This method is the most common, but does not work with encrypted and tunneled traffic.

2. Classification based on statistical analysis (time between packets, session time, etc.).

A universal approach to traffic classification is based on the information contained in the IP packet header. These are usually an IP address (layer 3), a MAC address (layer 2), and the protocol used. This approach has its limitations.

Deep package inspection (DPI) allows you to implement a more advanced package classification. The main mechanism for identifying applications in DPI is signature analysis [3]. Each application has its own unique characteristics, which are entered into the signature database. Comparing a sample from the database with the traffic being analyzed allows you to determine the application or protocol. However,

new applications appear periodically, and the signature database also needs to be updated to ensure high identification accuracy.

There are several methods of signature analysis:

1. The pattern of analysis. Applications contain specific examples of sequences in the data block of the package. They can be used for identification and classification. Not every package contains an example of application data, so the method does not always work.

2. Numerical analysis. Numerical analysis uses quantitative characteristics of the sequence of packets, such as: the size of the data block, response time, and the interval between packets. Simultaneous analysis of several packages takes a long time, which reduces the effectiveness of this method.

3. Behavioral analysis, heuristic analysis. The method is based on the analysis of the traffic dynamics of the running application. While the application is running, it creates traffic that can also be identified and tagged [4].

4. Protocol analysis/states. The protocols of some applications are a sequence of specific actions. The analysis of such sequences allows you to accurately identify the application.

When working with encrypted traffic, behavioral and heuristic analysis is used. For more accurate identification, cluster analysis is used, which combines heuristic and behavioral analysis methods.

The development of an analysis algorithm for classifying network traffic of secure connections of dedicated users according to a predetermined set of categories is relevant.

Let's consider two scenarios for analyzing network traffic:

encrypted traffic analysis,

Analysis of encrypted traffic passing through a virtual private network (VPN).

If the structure of the organization's local network is based on the encrypted connection network traffic analysis module, then the traffic comes from the edge router. Traffic is captured and preprocessed using. The main characteristics of the data stream are extracted from the received files. A vector of primary signs and sessions

lasting 15, 30, 60 and 120 seconds is formed. The generation and selection of features for training a neural network classifier is performed. The prepared feature vector is fed into the neural network analysis module of user sessions. The settings for training and work are set by the administrator.

Next, the following information is sent to the decision-making block: the decision of the base block on the type of traffic, the probability of traffic belonging to one of the main types and the types of recognized traffic from the neural network analysis block analyzing user sessions. The administrator can make adjustments to the current decision on the traffic type decision block.

Then the current traffic from the decision block and the tagged traffic from the basic traffic analysis module (sender's IP, recipient's IP, sender's port, recipient's port) are sent to the user session storage.

Next, the data on the reconnected user sessions is sent to the traffic type analysis module and the user type. An information security specialist receives information about the types of users and their rights. The administrator interacts with the repository to view and replenish the database, and also sets traffic capture parameters.

At the first stage, a fragment of the intercepted traffic is loaded, then the classifier script is selected. Based on the features specified in the scenario, a training sample is formed to build an initial knowledge base. After analyzing the specified features on the test sample, the accuracy of the classifier is determined. If the accuracy meets the requirements, the state is preserved, otherwise the loop returns to determining the type of scenario.

Traffic classification is based on the analysis of the temporal characteristics of the flow of intercepted network packets to form encrypted and VPN signs (temporary signs). The time characteristics of the stream make it possible to reduce the computational costs of building a set of features extracted from encrypted network traffic by reducing the set of fixed parameters.

The experiment uses a network traffic dump with 14 tags of various traffic types generated by various applications (7 for regular encrypted traffic and 7 for VPN traffic).

The criterion for the quality of traffic classification is the accuracy of the classification of samples. The accuracy of the classification can be assessed by cross-validation. The division into training and test sets is carried out by dividing the sample: the training set is two-thirds of the data, and the test set is one-third of the data.

The following algorithms are considered to solve the classification problem: Random Forest Algorithm (RFT), K-Nearest Neighbor Method (KNN), Multilayer Perceptron (MLP).

The source data is real traffic generated by applications and services such as Skype, Facebook, etc.

For each type of traffic (VoIP, P2P, etc.), open sessions and sessions are used in the created VPN tunnel, so there are 14 traffic categories in total: VoIP, VPN-VoIP, P2P, VPN-P2P, etc.

The traffic was captured using Wireshark sniffer. An external VPN service is used for VPN traffic. The connection was made using OpenVPN. To generate SFTP and FTPS traffic, an external service provider and FileZilla as a client were used.

The influence of the session duration of the captured data stream on the classification accuracy has been established. The developed classifier demonstrates recognition accuracy of up to 80% on the test sample. The MLP, RFT and KNN algorithms had almost identical indicators in all experiments.

It has also been found that the proposed classifiers work better when forming network traffic flows using short timeout values.

This method differs in the way of generating and selecting functions, which allows you to classify the existing traffic of secure connections of selected users according to a predefined set of categories. The developed algorithms make it possible to increase the security of the data transmission network by improving the algorithms for analyzing network traffic within the framework of the data leakage prevention system.

СПИСОК ЛИТЕРАТУРЫ:

1. Overview of the DPI - Deep Packet Inspection technology [Electronic resource]. - Access mode: https://habr.com/post/111054 / (accessed 07.11.2023);

2. Olifer V. G. Computer networks. Principles, technologies, protocols / V.G. Olifer, N.A. Olifer. - St. Petersburg: Peter, 2011. - 944 p;

3. Network packet analyzers [Electronic resource]. - Access mode: https://compress.ru / article.aspx?id=16244 (accessed 05.11.2023);

4. Russian DPI manufacturers and their platforms [Electronic resource]. - Access mode: https://vasexperts.ru/blog/rossijskie-proizvoditeli-dpi-i-ih-platfo / (accessed 04.11.2023)

i Надоели баннеры? Вы всегда можете отключить рекламу.