Научная статья на тему 'INTELLIGENT METHODS FOR AUTOMATIC INTRUSION DETECTION MODEL CONSTRUCTION IN INFORMATION SYSTEMS'

INTELLIGENT METHODS FOR AUTOMATIC INTRUSION DETECTION MODEL CONSTRUCTION IN INFORMATION SYSTEMS Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
1
1
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
Intrusion Detec on Systems / Data Mining / audit data / pa erns / associa on rules / Intrusion Detec on Systems / Data Mining / audit data / pa erns / associa on rules

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Abduaziz Abduraxmanov, Tolibjon Bo‘riboyev

The ar cle presents a set of tools that can be applied to various sources of audit data to create models of intrusion detec on. The central element of this approach is to apply to the program methods Data Mining are widely collected data audi ng to automa cally construct models of intrusion detec on that accurately capture the actual behavior (such as templates) intrusions and normal ac vi es. This approach greatly reduces the need for manual coding and analysis models of intrusion, as well as insight in the choice of sta s cal indicators to profiles of normal use. Today’s internets are made up of nearly half a million different networks. In any network connec on, iden fying the a acks by their types is a difficult task as different a acks may have various connec ons, and their number may vary from a few to hundreds of network connec ons. To solve this problem, a novel hybrid network IDS called NID-Shield is proposed in the manuscript that classifies the dataset according to different a ack types. Furthermore, the a ack names found in a ack types are classified individually helping considerably in predic ng the vulnerability of individual a acks in various networks. The hybrid NID Shield NIDS applies the efficient feature subset selec on technique called CAPPER and dis nct machine learning methods. The UNSW-NB15 and NSL-KDD datasets are u lized for the evalua on of metrics. Machine learning algorithms are applied for training the reduced accurate and highly merit feature subsets obtained from CAPPER and then assessed by the cross-valida on method for the reduced a ributes. Various performance metrics show that the hybrid NID-Shield NIDS applied with the CAPPER approach achieves a good accuracy rate and low FPR on the UNSW-NB15 and NSL-KDD datasets and shows good performance results when analyzed with various approaches found in exis ng literature studies. Research in network security is a vastly emerging topic in the domain of computer networking due to the ever-increasing density of advanced cybera acks. The intrusion detec on systems (IDSs) are designed to avert the intrusions and to protect the programs, data, and illegi mate access of the computer systems. The IDSs can classify the intrinsic and extrinsic intrusions in the computer networks of an organiza on and ins gate the alarm if security infringement is comprised in an organiza on network

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

INTELLIGENT METHODS FOR AUTOMATIC INTRUSION DETECTION MODEL CONSTRUCTION IN INFORMATION SYSTEMS

The ar cle presents a set of tools that can be applied to various sources of audit data to create models of intrusion detec on. The central element of this approach is to apply to the program methods Data Mining are widely collected data audi ng to automa cally construct models of intrusion detec on that accurately capture the actual behavior (such as templates) intrusions and normal ac vi es. This approach greatly reduces the need for manual coding and analysis models of intrusion, as well as insight in the choice of sta s cal indicators to profiles of normal use. Today’s internets are made up of nearly half a million different networks. In any network connec on, iden fying the a acks by their types is a difficult task as different a acks may have various connec ons, and their number may vary from a few to hundreds of network connec ons. To solve this problem, a novel hybrid network IDS called NID-Shield is proposed in the manuscript that classifies the dataset according to different a ack types. Furthermore, the a ack names found in a ack types are classified individually helping considerably in predic ng the vulnerability of individual a acks in various networks. The hybrid NID Shield NIDS applies the efficient feature subset selec on technique called CAPPER and dis nct machine learning methods. The UNSW-NB15 and NSL-KDD datasets are u lized for the evalua on of metrics. Machine learning algorithms are applied for training the reduced accurate and highly merit feature subsets obtained from CAPPER and then assessed by the cross-valida on method for the reduced a ributes. Various performance metrics show that the hybrid NID-Shield NIDS applied with the CAPPER approach achieves a good accuracy rate and low FPR on the UNSW-NB15 and NSL-KDD datasets and shows good performance results when analyzed with various approaches found in exis ng literature studies. Research in network security is a vastly emerging topic in the domain of computer networking due to the ever-increasing density of advanced cybera acks. The intrusion detec on systems (IDSs) are designed to avert the intrusions and to protect the programs, data, and illegi mate access of the computer systems. The IDSs can classify the intrinsic and extrinsic intrusions in the computer networks of an organiza on and ins gate the alarm if security infringement is comprised in an organiza on network

Текст научной работы на тему «INTELLIGENT METHODS FOR AUTOMATIC INTRUSION DETECTION MODEL CONSTRUCTION IN INFORMATION SYSTEMS»

INTELLIGENT METHODS FOR AUTOMATIC INTRUSION DETECTION MODEL CONSTRUCTION IN

INFORMATION SYSTEMS

Abduaziz ABDURAXMANOV Alfraganus University Tolibjon BO'RIBOYEV Alfraganus University

Abstract

The article presents a set of tools that can be applied to various sources of audit data to create models of intrusion detection. The central element of this approach is to apply to the program methods Data Mining are widely collected data auditing to automatically construct models of intrusion detection that accurately capture the actual behavior (such as templates) intrusions and normal activities. This approach greatly reduces the need for manual coding and analysis models of intrusion, as well as insight in the choice of statistical indicators to profiles of normal use. Today's internets are made up of nearly half a million different networks. In any network connection, identifying the attacks by their types is a difficult task as different attacks may have various connections, and their number may vary from a few to hundreds of network connections. To solve this problem, a novel hybrid network IDS called NID-Shield is proposed in the manuscript that classifies the dataset according to different attack types. Furthermore, the attack names found in attack types are classified individually helping considerably in predicting the vulnerability of individual attacks in various networks. The hybrid NID-Shield NIDS applies the efficient feature subset selection technique called CAPPER and distinct machine learning methods. The UNSW-NB15 and NSL-KDD datasets are utilized for the evaluation of metrics. Machine learning algorithms are applied for training the reduced accurate and highly merit feature subsets obtained from CAPPER and then assessed by the cross-validation method for the reduced attributes. Various performance metrics show that the hybrid NID-Shield NIDS applied with the CAPPER approach achieves a good accuracy rate and low FPR on the UNSW-NB15 and NSL-KDD datasets and shows good performance results when analyzed with various approaches found in existing literature studies. Research in network security is a vastly emerging topic in the domain of computer networking due to the ever-increasing density of advanced cyberattacks. The intrusion detection systems (IDSs) are designed to avert the intrusions and to protect the programs, data, and illegitimate access of the computer systems. The IDSs can classify the intrinsic and extrinsic intrusions in the computer networks of an organization and instigate the alarm if security infringement is comprised in an organization network

Keywords: Intrusion Detection Systems, Data Mining, audit data, patterns, association rules

Annotation

Ushbu maqola tajovuzni aniqlash modellarini yarati'sh uchun turli xil audit ma'lumotlari manbalariga qo'llanilishi mumkin bo'lgan vositalar to'plamini taqdim etadi. Ushbu yondashuvning markaziy elementi bosqinlarning haqiqiy xatt'-harakatlarini (masalan, naqshlari) va normal faoliyatni aniq aks ettiruvchi tajovuzlarni aniqlash modellarini avtomati'k ravishda yarati'sh uchun keng to'plangan audit ma'lumotlariga ma'lumotlarni qidirish texnikasi dasturini qo'llashdir. Ushbu yondashuv qo'lda tahlil qilish va kirish shakllarini kodlash, shuningdek oddiy foydalanish profillari uchun stati'stik ma'lumotlarni tanlashda sezgi ehti'yojini sezilarli darajada kamayti'radi. Bugungi internetlar yarim millionga yaqin turli tarmoqlardan iborat. Har qanday tarmoq ulanishida hujumlarni ularning turlari bo'yicha aniqlash qiyin ishdir, chunki har xil hujumlar turli xil ulanishlarga ega bo'lishi mumkin va ularning soni bir necha dan yuzlab tarmoq ulanishlariga farq qilishi mumkin. Ushbu muammoni hal qilish uchun qo'lyozmada nid-Shield deb nomlangan yangi gibrid tarmoq identi'fikatorlari taklif qilingan bo'lib, ma'lumotlar to'plamini turli xil hujum turlariga ko'ra tasniflaydi. Bundan tashqari, hujum turlarida topilgan hujum nomlari alohida-alohida tasniflanadi, bu turli xil tarmoqlarda individual hujumlarning zaifligini bashorat qilishga yordam beradi. Gibrid NID-Shield NIDS CAPPER deb nomlangan samarali xususiyatlar to'plamini tanlash texnikasini va alohida mashinani o'rganish usullarini qo'llaydi. NSl-NB15 va NSl-KDD ma'lumotlar to'plamlari o'lchovlarni baholash uchun ishlatiladi. Mashinani o'rganish algoritmlari CAPPER-dan olingan qisqarti'rilgan aniq va yuqori darajadagi xususiyatlar to'plamlarini o'qitish uchun qo'llaniladi va

keyin qisqartirilgan atributlar uchun o'zaro tekshirish usuli bilan baholanadi. Turli xil ishlash ko'rsatkichlari shuni ko'rsatadiki, CAPPER yondashuvi bilan qo'llaniladigan gibrid nid-Shield NIDS-NB15 va NSL-KDD ma'lumotlar to'plamlarida yaxshi aniqlik darajasi va past FPRGA erishadi va mavjud adabiyot tadqiqotlarida topilgan turli yondashuvlar bilan tahlil qilinganda yaxshi ishlash natijalarini ko'rsatadi. Tarmoq xavfsizligi bo'yicha tadqiqotlar-bu rivojlangan kiberhujumlarning tobora orti'b borayotgan zichligi tufayli kompyuter tarmog'i sohasida juda katta rivojlanayotgan mavzu. Kirishni aniqlash tizimlari (IDSs) kirishni oldini olish va dasturlarni, ma'lumotlarni va kompyuter tizimlarining noqonuniy kirishini himoya qilish uchun mo'ljallangan. IDSs tashkilotning kompyuter tarmoqlaridagi ichki va tashqi tajovuzlarni tasniflashi va agar xavfsizlik buzilishi tashkilot tarmog'ida bo'lsa, signalni qo'zg'atishi mumkin

Kalit so'zlar: intrusion detection systems, data mining, audit data, templates, association rules

Most intrusion detection approaches rely on system analysis and auditing of network data. Network traffic can be captured using packet capture utilities, and operating system activity can be logged at the system call level. The fundamental premise is that, when audit mechanisms are enabled, various pieces of evidence of both legitimate activities and intrusions will manifest in the audit data. Therefore, instead of statically analyzing all sources of software code, a more practical approach is used for intrusion detection, involving the analysis of audit records during the execution of network activities, system activities, and user actions.

INTRODUCTION

At an abstract level, Intrusion Detection Systems (IDS) identify features, i.e., individual pieces of evidence, at the level of system events or network packet data in audit records. They also use various modeling and analysis algorithms to justify the existing evidence [1-4]. Traditionally, IDS systems are developed based on knowledge engineering techniques. Expert knowledge or network training, operating systems, and attack methods are used to select features and craft detection rules manually. Given the complexity of modern network environments and the sophistication of attackers, so-called expert knowledge often proves limited and unreliable.

On the other hand, data mining technology methods (Data Mining, DM) can be used to extract features and computational detection models from a vast amount of audit data [5-15]. Features computed from data can be more "objective" than those selected by experts. Inductive learning detection models can be more "generalized" than manually coded rules (i.e., they may perform better in terms of detecting new variants of known normal behavior or intrusions). Therefore, the DM approach can play a significant role in IDS development. It should be noted that DM methods can complement, not replace, the use of expert knowledge. The goal should be to provide tools based on quality statistics and machine learning principles to quickly and easily develop better ID models for IDS development. For example, experts can review and edit templates and create rules based on DM methods, translating them into efficient detection modules.

In general, Data Mining (DM) is the process of extracting useful models from large data repositories. The continuous development of DM technology has made various algorithms available, drawn from the fields of statistics, pattern recognition, machine learning, and databases. Several types of algorithms are considered particularly useful for audit data development:

Classification: This involves mapping data elements into one of several predefined categories. These algorithms typically "classify" output, such as in the form of decision trees or rules. The ideal application of intrusion detection is based on gathering a sufficient amount of "normal" and "abnormal" audit data for a user or program and then applying a classification algorithm to identify classifiers that can label or predict new unseen audit data as belonging to either the normal class or the abnormal class.

Association analysis: It determines relationships between fields in database records. The correlation of feature sets in audit data, such as the correlation between a command and its argument in historical user data, can serve as a basis for constructing profiles of normal users.

Sequence analysis: Models sequences of patterns. These algorithms can detect recurring sequences of audit events occurring simultaneously. These templates of recurring events provide guiding principles for incorporating temporal statistics into intrusion detection models. For example, templates

from audit data containing Distributed Denial of Service (DoS) attacks imply that multiple host and service measurements should be included.

Examples of commercial implementations of DM methods in intrusion detection systems include projects like ADAM (Audit Data Analysis and Mining) at George Mason University [16-18], MADAM ID (Mining Audit Data for Automated Models for Intrusion Detection) at Columbia University [9, 10], MINDS at the University of Minnesota [19-20], the Data Mining for Network Intrusion Detection project at MITRE Corporation [21], and several other developments.

In particular, the structure of the MADAM ID intrusion detection system is based on the application of DM methods to build intrusion detection models. The basic components of the structure include programs for training classifiers and meta-classifiers, associative rules for relationship and recurring event analysis for sequence analysis. It also contains an environment support allowing system integrators to interactively conduct the process of building and evaluating detection models. The end products are concise and intuitive rules that can detect intrusions and can be easily reviewed and edited by security experts as needed.

In general, the process of applying a DM-based IDS model can be summarized as the system shown in Figure 1. The initial binary audit data is first transformed into ASCII (American Standard Code for Information Interchange) network packet or host event data, which is then aggregated into connection records or host session records containing key features such as service, duration, and so on. DM programs are applied to connection records to compute recurring patterns (e.g., associative rules and episode frequencies), which are, in turn, analyzed to generate additional connection record features. Classification programs, such as RIPPER [4, 22, 23], are then used for inductive learning of detection models. This process is naturally iterative, as poor classification model performance often indicates the need for more sophisticated patterns and feature engineering.

This process represents a crucial part of intrusion detection system development using Data Mining methods, allowing for the automated analysis and detection of anomalies in audit data to enhance the security of network infrastructure.

1

ynai:eTbii 1-----

l coobiTHH:'

M CKOflH b№

AaHHhie ayftMia

|-► / |ASCil)f/*-

T

Figure 1 - Intrusion Detection Model Formation Process

In this approach, learning rules replace manual coding of templates and profiles, and feature sets and indicators are chosen by examining statistical regularities and computed from audit data. Meta-learning is used to train the correlation of evidence across multi-model intrusion detection and to create a combined detection model.

Such a model does not eliminate the need for preprocessing and analysis of raw audit data, such as tcpdump or BSM audit data. In fact, to build intrusion detection models for network systems, the proposed DM programs use preprocessed audit data where each record corresponds to a high-level event (e.g., network connection or host session). Each record typically includes an extensive set of features describing event characteristics, such as connection duration, bytes transferred, and so on. While analyzing and summarizing raw audit data is an important task for IDS, it should be argued that common utilities should be initially developed by network and operating system experts and made available to all IDS systems as low-level building blocks. Bro and NFR can be considered examples of such reliable utilities as they both perform IP packet filtering and reassembly and allow event

processing to output summarized connection records. This DM model assumes that such blocks can be applied in IDS construction.

An example of feature construction using the discussed DM model is provided in Figure 2.

However, it should be noted that this DM model produces models for detecting network and host system abuses as well as anomaly detection models for users. Therefore, efforts should be made to extend this model to build network and host anomaly detection models.

The first step in applying or developing DM approaches to applications is to form a basic understanding of the domain._Let's briefly consider the main characteristics of audit data.

UJaSnoHbif

■■■■Записи аномалий'» злоупотребиен№

Сравнение!

Новые' признаки^

Figure 2 - Feature Construction Using DM Model

Firstly, audit system data is "raw," meaning it is in binary format, unstructured, and time-dependent. For DM, it is necessary to preprocess audit data into a suitable form, i.e., ASCII tabular data with attributes (or features). For example, data output from [26] contains binary records describing network packets. These records are sorted by timestamps (i.e., packet arrival times). To analyze network connections, all data packets belonging to the same connection should be "summarized." Connection data in ASCII format may include, for each connection, source and destination host addresses, services (e.g., Telnet, FTP, etc.), bytes transferred, and more, describing the connection's activity. The main goal of preprocessing audit data is to extract and build relevant features that enable the development of effective detection models. DM's task is to develop methods for automating a series of data preprocessing and feature extraction processes.

Secondly, audit data contains extensive network and system semantics. For instance, network connections originating from the same host are likely to belong to that host, or a request for the same service may be a "repeat" to a specific user or active program. Such semantics or contextual information is highly valuable in intrusion detection. DM's task is to fine-tune generic connection algorithms to extract only relevant patterns from audit data.

Thirdly, audit data is high-speed and high-volume streaming data. Audit mechanisms are designed to record all network and system activities in high detail. While it can be guaranteed that some intrusion evidence will be missed, the high speed and data volume require efficient detection model execution and efficiency. Otherwise, a prolonged delay in data analysis simply represents a time window for a successful attack. DM's task is to develop methods to calculate detection models that are not only accurate but also time-efficient.

The proposed algorithm for applying intelligent methods to automatically build intrusion detection models consists of three stages: template development, feature construction, and efficient model development.

The first stage begins with calculating associative rules and recurring episodes from audit data, capturing intra- and inter-audit record patterns. These recurring patterns can be seen as statistical summaries of network and system events captured during data audit, as they measure correlations between feature sets and the temporal co-occurrence of events.

Basic associative rule and recurring episode algorithms do not consider any domain-specific knowledge. In other words, assuming I is an interest score for a pattern p, then I(p) = f(support(p), confidence(p)), where f is a ranking function. As a result, basic algorithms can generate many rules that

are "uninteresting" (i.e., not useful). When configuring these audit data algorithms, knowledge-level schemes are included in the interest score. Assuming IA is an interest score for a pattern p containing specifically important (i.e., "interesting") attributes, our extended interest score is represented as Ie(p) = fe(IA(p), f(support(p), confidence(p))) = fe(IA(p), I(p)), where fe is a ranking function that primarily takes into account the attributes in the pattern, followed by the support and confidence values.

Let's consider two types of important knowledge-level schemes regarding audit data:

Firstly, there is a partial "order of importance" among attributes from audit records. Some attributes are crucial in describing the data, while others provide auxiliary information. For example, a network connection can be uniquely identified by a combination of its start time, source host, source port, destination address, and destination port. These are essential features when describing network data. It can be argued that "relevant" associative rules should describe patterns associated with essential attributes. Essential attribute(s) act as an attribute axis when used as a constraint element in the associative rule algorithm. During candidate generation, the set should contain the value (values) of the attribute axis. It should be assumed that correlations between non-axis attributes are not of interest. In other words, if pattern p contains the attribute axis (axes), then IA(p) = 1; otherwise, IA(p) = 0. To avoid an abundance of "uninformati've" episode rules, basic recurring episode algorithms should be extended to compute recurring sequential patterns in two stages: calculate recurring associations using the attribute axis, and then generate recurring sequential patterns from these associations.

Another interesting knowledge-level scheme involves the fact that some attributes can be references to other attributes. A group of events is considered related if they have certain reference attribute values. For instance, connections to the same destination host can be related to each other. When designing templates for such related events, it's necessary to use the reference attribute as a constraint element. In other words, when forming an episode, an additional condition is that, at a minimum, records indicated by their component sets have the same attribute value(s) as the reference attribute(s). In other words, if sets r refer to the same reference attribute value(s), then IA(r) = 1; otherwise, IA(r) = 0.

Templates, i.e., recurring episodes, computed using our extended algorithm, can be compared with a set of intrusions and templates from normal data to identify those that only manifest in intrusion data. These templates are then used to develop features [27]. The idea is to first transform the templates into numbers so that "similar" templates are associated with "close" numbers. Then, the compared template and intrusion template are identified and matched by comparing the numbers and ranking the results.

A coding procedure is proposed that converts each template into a numerical number, where the order of significance corresponds to the importance order of the features. During encoding, each unique feature value is mapped to digit/place values. The "distance" between two templates in this case is simply a number, where each digit/place value's value is the digital value of the absolute difference between two encodings. The comparison procedure calculates an "intrusion score" for each template from the intrusion data set, which is its lower bound distance score against all templates from the normal data, and the outputs are the top part of the templates with high intrusion scores as "intrusion-only" templates.

As an example, let's consider a synchronous attack (SYN flood attack) where an attacker uses many spoofed source addresses to send a batch of SYN connections (only the first SYN packet, a connection request) to the victim host's port (e.g., HTTP) in a very short period (the victim's connection buffer is filled, leading to a denial of service). The table shows one of the top intrusion-only templates obtained using services as the axis and dst_host as a feature.

Table - Example of Intrusion Pattern

Recurring Episode Value

((Flag = S0, Service = http, dst_host = victim), (Flag = S0, Service = http, dst_host = victim) ^ (Flag = S0, Service = http, dst_host = victim) [0.93, 0.03, 2] This recurring episode signifies that 93% of the time, after two HTTP connections with an S0 flag to the victim host, within 2 seconds from the first of these two connections, a third

similar connection is established. The pattern occurs in 3% of the data.

Each of the intrusion patterns serves as a reference point for adding additional features to connection records when constructing more sophisticated classification models. In this case, the following automated procedure is used for syntactical parsing and feature construction of the repeating episode:

Let F0 (e.g., dst_host) be used as a feature indicator, and the duration of the episode be w seconds. We add the following features, which are only considered for connections in the previous w seconds and are a fraction of the value in F0, like the current connection:

A feature calculated as the "unit count of this connection.

Let F1 be service, src_dst, or dst_host, except for F0 (e.g., F1 is one of the selected features). If F1's value (e.g., "http") is in all elements of the episode set, we add a feature that calculates the "percentage of connections sharing the same F1 value as the current connection." Otherwise, we add a feature that calculates the "percentage of different Fi values."

Let V2 be the value (e.g., "S0") of feature F2 (e.g., flag) except for F0 and F1 (i.e., V2 is a value of non-essential features). If V2 is present in all elements of the episode set, we add a feature that calculates the "percentage of connections having the same V2." Otherwise, if F2 is a numerical function, we add a feature that calculates the "average value of F2."

This procedure analyzes the repeating episode and utilizes three operators: count, percentage, and averages, in constructing statistical features. These features are also temporal since they evaluate only connections within the time window w and jointly utilize the same feature indicator's value.

The feature construction algorithm is derived from a straightforward interpretation of the repeating episode. For instance, if the same feature value appears in all sets of episodes, there is a high percentage of records sharing the same value. Essential and non-essential features should be treated separately. Feature extraction describes the anatomy of intrusions, e.g., "is the same service targeted (e.g., port)?" Actual values, e.g., "http," are often not significant because the same attack method can be used against various targets, e.g., "ftp." However, the actual unidentified feature value, e.g., flag = S0, often indicates an intrusion invariant as they generalize connection behavior according to network protocols. The "SYN Flood" template displays the results of the following additional features in the table: counting connections to the same dst_host in the last 2 seconds, among these connections, the fraction that has the same service, and the percentage that has the "S0" flag.

A detection model is considered effective if its latency (and analysis), detection, or computational costs are low enough for the model not to lag behind data stream execution time (i.e., it can detect and respond to intrusion before significant damage occurs). The computational costs of the model mainly consist of the costs of computing the necessary features. Feature costs include not only the time required to calculate their values but also the delay in their readiness (i.e., when they can be computed).

Let's categorize the features into three relative levels of costs. First-level features, such as services, are calculated using the majority of the first three packets (or events) of the connection (or host session). Typically, they require only simple record-keeping. Second-level features are calculated in the middle or at the end of the connection using only information from the current connection. They usually require straightforward "accounting." Third-level features are calculated using information from all connections within the current connection's time window. They are often computed as aggregates of first and second-level features. Qualitative values for these cost levels should be assigned based on the measurement runtime with a prototype system developed using Network Flight Recorder (NFR) [28]: first-level costs are 5, second-level costs are 10, and third-level costs are 100. It is important to note that first and second-level features should be computed individually. However, since all three levels of features require iterating through the entire set of currently connected windows, they can all be computed simultaneously in one iteration. This saves computational costs when multiple third-level features are calculated for the analysis of a given connection.

To reduce the computational costs of the intrusion detection model, the capabilities of low-cost features should be utilized while maintaining the desired level of accuracy. This approach involves

creating multiple models, each of which utilizes the capabilities of different cost levels. Low-cost models are always evaluated first by the IDS, and high-cost models are only used when low-cost models cannot predict intrusion with sufficient accuracy. A comprehensive approach based on a set of RIPPER rules is proposed.

Before discussing the details of the proposed approach, it is necessary to outline the advantages and disadvantages of different forms of rule sets that RIPPER can generate: ordered or unordered.

Ordered rule sets: An ordered rule set has the format "if r1, then i1, else if r2, then i2, ..., else default value." Before learning rules from the dataset, RIPPER heuristi'cally orders one of the following ways as the first heuristic class of orderings: +frequency, increasing frequency; -frequency, decreasing frequency; user-defined ordering; mdl, minimum description length of the heuristic ordering. After ranking the classes, RIPPER finds rules for extracting classl from classes class2, ..., classn, rules for extracting class2 from classes class3, ..., classn, and so on. The final class, classn, becomes the default one. The end result is a single class always grouped together, but the rules for classi may be simplified because they may assume that classi is one of classi, ..., classn. If an example is covered by rules from two or more classes, this conflict will be resolved in favor of the class that ranks first in the ordering.

An ordered set of rules is typically concise and efficient. Evaluating the entire ordered set of rules does not require testing each rule individually but proceeds from the top of the rule set to the bottom until any rule is deemed true. The features used for each rule can be calculated one by one, as the evaluation progresses. The computational costs for evaluating an ordered set of rules for a given connection are the total costs of unique features, as long as a prediction is being made. In any rational network environment, most connections are normal. The frequency-based set of rules is likely to have low computational costs and high accuracy in identifying normal connections with the top classification rules. In contrast, the frequency-based set of rules is likely to have higher computational costs but better intrusion classification accuracy than the frequency-based set of rules defining intrusions in the format of normal connections and are at the bottom of the default-accepted rules (fault rule). Depending on the class order, implementing this and mdl will be between -frequency and +frequency.

An unordered set of rules: An unordered set of rules has at least one rule for each class, and there are usually many rules for frequently occurring classes. There is also a default class used for prediction when none of these rules are satisfied. Unlike an ordered set of rules, all rules are evaluated during prediction violations and conflicts using the most accurate rules. In general, an unordered set of rules contains more rules and is less efficient in execution than the -frequency and +frequency ordered rule sets. However, there are usually several high-accuracy rules for most normal class features.

Considering the advantages and disadvantages of ordered and unordered rule sets, the following multi-model (composite) rule set approach is proposed:

First, generate diverse training groups T1, ..., T4 using different subsets of features. T1 uses only first-level costs. T2 uses first and fifth-level features, and so on, up to T4, which uses all available features.

Rule sets R1, ..., R4 are trained using their respective training groups. R4 is trained either as a +frequency or -frequency rule set to improve efficiency, as it may contain the most expensive features. R1, ..., R3 are trained either as -frequency or unordered rule sets, as they will contain accurate rules for classifying normal connections, and we filter out normals as early as possible to reduce computational costs.

Measurement accuracy pr is calculated for each rule r, except for rules in R4.

Threshold values Ti are obtained for each individual class. It determines the acceptable accuracy required for classification by any rule set except R4.

In real-time execution mode, feature calculation and rule evaluation are performed by the following actions:

R1 is evaluated, and predictions i are made.

If pr > Ti, the i prediction will be lifted. In this case, there are no more features to calculate, and the system will consider the next connection. Otherwise, the required additional features are calculated using R2, and R2 is evaluated.

Evaluation will continue with R3, and then R4 until a prediction is made. R4 evaluations do not require any state activation and will always generate predictions.

Computational costs for a single connection represent the total complexity of all unique features used before making a prediction. If at any level 3 features (for cost 100) are used for all, the cost is only counted once, as all level 3 features are calculated in one feature call.

Accuracy and threshold values can be obtained during model training either from the training set or individual extractions from the validation set. Set thresholds are set based on the accuracy of R4 for each class on this dataset, as we want to achieve the same accuracy as R4. Rule accuracy can be easily obtained with positive p and negative n by calculating the rule p/(p + n). Threshold values, on average, ensure that predictions emitted by the first three rule sets are no less accurate than using a single R4.

CONCLUSION

A set of tools has been presented that can be applied to various audit data sources to create intrusion detection models. This approach is data-oriented and considers intrusion detection as a data analysis process. Anomaly detection is achieved by searching for normal user patterns in audit data, while misuse detection is achieved by encoding and matching intrusion patterns using audit data. The central element of this approach is the application of intelligent methods to widely collected audit data for the automatic construction of intrusion detection models that accurately capture the actual behavior (e.g., patterns) of intrusions and normal activities. This approach significantly reduces the need for manual analysis and coding of intrusion models, as well as intuition in selecting statistical indicators for normal usage profiles.

"The obtained models can be more effective as they are computed and validated using a large volume of audit data."

LITERATURE

Said H.M., et al. Intelligence Techniques for e-government applications // International Journal of Emerging Trends & Technology in Computer Science (IJETTCS). - March-April 2015. -Vol. 4, Issue 2. - P. 6-20.

El-Bakry H.M., Mastorakis N.A Real-Time Intrusion Detection Algorithm for Network Security // WSEAS on Communications. - December 2008. - Vol. 7, Issue 12. - P. - 1222-1234.

Hanaa M. S., et al. Neural networks approach for monitoring and securing the E-Government informational systems // European Journal of Computer Science and Information Technology. -December, 2014. - Vol.2, №4. - P. 29-39.

Lee W., Fan W. Mining System Audit Data: Opportunities and Challenges //SIGMOD Record. - 2001. - Vol. 30, №4. - P. 35-44.

Abhaya K.K., et al. Data Mining Techniques for Intrusion Detection: A Review // International Journal of Advanced Research in Computer and Communication Engineering. - June 2014. -Vol. 3, Issue 6. - P. 6938-6942.

Coppolino L., et al. Applying Data Mining Techniques to Intrusion Detection in Wireless Sensor Networks /Published in: P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC), 2013 Eighth International Conference on Date of Conference: 28-30 Oct. 2013. - P. 247 - 254.

Hanaa M. S., et al. A Study on Data Mining Frameworks /In: Cyber Security. - WSEAS / NAUN International Conferences, Dubrovnik, Croatia. - June, 2013. - P. 204 - 209.

Lee W., et al. Real time data mining-based intrusion detection. /Proceedings on DARPA Information Survivability Conference & Exposition II. - 2001. - Vol. 1. - P. 89-100.

Lee W., Stolfo S.J. Data Mining Approaches for Intrusion Detection. /In: Proceedings of the USENIX Security Symposium, San Antonio, TX, January, 1998.

Lee W., et al. Mining audit data to build intrusion detection models. /In: Proceedings Fourth International Conference on Knowledge Discovery and Data Mining. - New York, 1998.

Maimon O., Rokach L.(Eds.) Data Mining and Knowledge Discovery Handbook. - 2nd ed. - Springer Science+Business Media, LLC 2010.

Nadiammai G.V., Hemalatha M. Effective approach toward Intrusion Detection System using data mining techniques // Egyptian Informatics Journal. - March 2014. - Vol. 15, Issue 1. - P. 37-50.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Wankhade K., et al. An efficient approach for Intrusion Detection using data mining methods /In: Advances in Computing, Communications and Informatics (ICACCI), 2013 International Conference on Date of Conference: 22-25 Aug. 2013. - P. - 1615 - 1618.

Youssef A., Emam A. Network intrusion detection system using data mining and network behavior analysis //International Journal of Computer Science & Information Technology (IJCSIT). - December, 2011. - Vol 3, №6. - P. 87-98.

Chandola V., et al. Data Mining для Cyber Security. /In: Data Warehousing and Data Mining Techniques for Computer Security, Springer, 2006. - P. 83-107.

Barbara D., et al. Adam: Detecting Intrusions by Data Mining. /In: Proceedings 2-nd Annual IEEE Information Assurance Workshop. - NY: West Point, June 2011.

Barbara D., et al. Detecting Novel Network Intrusions Using Bayes Estimators. /In: Proceedings of the First SIAM Conference on Data Mining, Chicago, April 2011.

Singhal A. Data Warehousing and Data Mining Techniques for Cyber Security - Springer, 2007. Ertoz L., et al. Detection of Novel Attacks using Data Mining. /Proceedings IEEE Workshop on Data Mining and Computer Security, November 2013.

Lazarevic A., et al. A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection. /In: Proceedings of the Third SIAM International Conference on Data Mining. - San Francisco.

- May, 2013.

Maloof M.A. Machine Learning and Data Mining for Computer Security Methods and Applications.

- Springer-Verlag London Limited, 2006.

Cohen W.W. Fast effective rule induction. /In: Proceedings of the 12th International Conference on Machine Learning. - Morgan Kaufmann.Tahoe City. - P. 115-123, 2005.

i Надоели баннеры? Вы всегда можете отключить рекламу.