Научная статья на тему 'Realtime text stream anomalies analysis system'

Realtime text stream anomalies analysis system Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
155
24
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
data mining / classification of textual information / content analysis / machine learning / classification algorithms / интеллектуальный анализ данных / классификация текстовой информации / анализ контента / потоки данных / алгоритмы классификации

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — V. M. Tomashevskii, Y. O. Oliynik, V. V. Yaskov, V. M. Romanchuk

Our project is an anomalies detecting system in real-time data streams in real-time mode. Anomaly is a deviation from the norm or general ordinarity. We can consider information of different types as a data stream. For example, server logs, information about customers visits at website, clicks on advertisement posted on the Internet, etc. So data streams can be tweets, messages which are published by the people from all over the world on the popular site Twitter.com. The tweet is a post which is published on users pages of the popular social network. Message should not exceed 140 characters by twitter.com rules. The data stream analysis will be done by means / using of MLlib library, which is the part of the Apache Spark. In general, we can take any data stream and create test data samples that are satisfactory to us and do not go beyond ordinarity, as well as test samples that have anomalies. Such a test sample will be tweets that were written by people who were related to terrorist acts. The result of program will be user's alert about detecting social network messages with anomaly or other words of detecting suspicious messages that may precede terrorist attacks. Ideally, the program’s working can help to prevent a terrorist act and to save people lives.Methods for problem solving. In our system, to solve the problem of real time recognizing abnormalities in data stream, the Isolation Forest algorithm is used. The method by which the algorithm constructs a partition initially creates an isolation tree or random decision tree. Then, the estimate is calculated as the length of the path for the isolation of the observation. Tree is build based on a extracted keywords from tweets using RAKE algorithm. Those keywords are translated to feature vectors with a help of word2vec predefined models of Matlib before they can be used in DecisionTrees. Software architecture. The program is the integration of several powerful technologies. We take the data stream from site twitter.com by using Twitter.API, which transmits posts and all attributes of instances to the Apache Kafka. Kafka is the system that works on principle “publisher subscribers”. Data stream goes to the broker, and server processiong it by RDD parties. Under the control of the zookeeper server, the consumer signs up for obtaining of incoming data stream. Then the zookeeper directs the stream from the consumer to the Apache Spark.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

СИСТЕМА АНАЛИЗА АНОМАЛИЙ В ТЕКСТОВЫХ ПОТОКАХ ДАННЫХ В РЕАЛЬНОМ ВРЕМЕНИ

Проект посвящен разработке системы мониторинга аномалий в потоках текстовых данных в реальном времени. Цели разработки – обеспечение анализа потока текстовых данных в режиме реального времени с использованием методов и алгоритмов машинного обучения.

Текст научной работы на тему «Realtime text stream anomalies analysis system»

UDC 519.68; 681.513.7; 612.8.001.57; 007.51/.52

V.M. TOMASHEVSKII, Y.O. OLIYNIK, V.V. YASKOV, V.M. ROMANCHUK

NTUU "Igor Sikorsky Kyiv PolytechnicInstitute"

REALTIME TEXT STREAM ANOMALIES ANALYSIS SYSTEM

Our project is an anomalies detecting system in real-time data streams in real-time mode. Anomaly is a deviation from the norm or general ordinarity. We can consider information of different types as a data stream. For example, server logs, information about customers visits at website, clicks on advertisement posted on the Internet, etc. So data streams can be tweets, messages which are published by the people from all over the world on the popular site Twitter.com. The tweet is a post which is published on users pages of the popular social network. Message should not exceed 140 characters by twitter.com rules.

The data stream analysis will be done by means / using of MLlib library, which is the part of the Apache Spark. In general, we can take any data stream and create test data samples that are satisfactory to us and do not go beyond ordinarity, as well as test samples that have anomalies. Such a test sample will be tweets that were written by people who were related to terrorist acts. The result of program will be user's alert about detecting social network messages with anomaly or other words of detecting suspicious messages that may precede terrorist attacks. Ideally, the program's working can help to prevent a terrorist act and to save people lives.Methods for problem solving.

In our system, to solve the problem of real time recognizing abnormalities in data stream, the Isolation Forest algorithm is used. The method by which the algorithm constructs a partition initially creates an isolation tree or random decision tree. Then, the estimate is calculated as the length of the path for the isolation of the observation. Tree is build based on a extracted keywords from tweets using RAKE algorithm. Those keywords are translated to feature vectors with a help of word2vec predefined models of Matlib before they can be used in DecisionTrees.

Software architecture. The program is the integration of several powerful technologies. We take the data stream from site twitter.com by using Twitter.API, which transmits posts and all attributes of instances to the Apache Kafka. Kafka is the system that works on principle "publisher - subscribers". Data stream goes to the broker, and server processiong it by RDD parties. Under the control of the zookeeper server, the consumer signs up for obtaining of incoming data stream. Then the zookeeper directs the stream from the consumer to the Apache Spark.

Keywords: data mining, classification of textual information, content analysis, machine learning, classification algorithms

В.М. ТОМАШЕВСЬКИЙ, Ю.О. ОЛ1ЙНИК, В.В. ЯСЬКОВ, В.М. РОМАНЧУК

НТУУ "КП1 iM. 1горя Сжорського"

СИСТЕМА АНАЛ1ЗУ АНОМАЛ1Й У ТЕКСТОВИХ ПОТОКАХ ДАНИХ У РЕАЛЬНОМУ

ЧАС1

Проект присвячений розробцi системи монторингу аномалш в потоках текстових даних в реальному чаа. Цт розробки - забезпечення анал1зу потоку текстових даних у режимi реального часу з використанням методiв та алгоритмiв машинного навчання.

Ключовi слова: ттелектуальний аналiз даних, класифжащя текстовое iнформацii] аналiз контенту, потоки даних, алгоритми класифiкацii.

В.М. ТОМАШЕВСЬКИЙ, Ю.О. ОЛЕЙНИК, В.В. ЯСЬКОВ, В.М. РОМАНЧУК

НТУУ "КПИ им. Игоря Сикорского"

СИСТЕМА АНАЛИЗА АНОМАЛИЙ В ТЕКСТОВЫХ ПОТОКАХ ДАННЫХ В РЕАЛЬНОМ

ВРЕМЕНИ

Проект посвящен разработке системы мониторинга аномалий в потоках текстовых данных в реальном времени. Цели разработки - обеспечение анализа потока текстовых данных в режиме реального времени с использованием методов и алгоритмов машинного обучения.

Ключевые слова: интеллектуальный анализ данных, классификация текстовой информации, анализ контента, потоки данных, алгоритмы классификации

Problem Statement

There are many system for text analysis like CUE-CNN [4], election analysis [8]. But this systems can't work in real-time. Therefore need develop scalable system that can perform data stream in real-time mode.

The main goal of article is create text stream anomalies analysis system that can work in real-time mode.

Today the importance of text mining is rapidly increasing. This is due to the large amount of text information (that is) available through Internet. Since millions of content symbols are formed every day, the person does not have the physical ability to process all the information.

Actually existing tools for text mining (for example, the "Statistica" system) do not allow to obtain the acceptable results to be achieved in solving the task of text classification, therefore, there is a need in the development of new algorithms.

Now almost all companies are seeking for reliable data storing and maintaining of corresponding reporting and documentation regardless of the endpoint product or in which spheres they are operating.

Purpose of the Study

Yet a few years ago it was implemented by means of data storing on relational databases. And all the reported data was calculated at nights by using of automated processes. But the development of Internet technology and its broad usage have forced everyone to move on further. The next step is the big data processing, and then its processing in real-time mode. That is, any event immediately comes to the server and is stored, and the server immediately reacts on the new events and displays it in its report.

Our project is an anomalies detecting system in real-time data streams in real-time mode. Anomaly is a deviation from the norm or general ordinarity. We can consider information of different types as a data stream. For example, server logs, information about customers visits at website, clicks on advertisement posted on the Internet, etc. So data streams can be tweets, messages which are published by the people from all over the world on the popular site Twitter.com. The tweet is a post which is published on users pages of the popular social network. Message should not exceed 140 characters by twitter.com rules.

Analysis of Resent Researches and Publications

The data stream analysis will be done by means / using of MLlib library, which is the part of the Apache Spark[2] . In general, we can take any data stream and create test data samples that are satisfactory to us and do not go beyond ordinarity, as well as test samples that have anomalies. Such a test sample will be tweets that were written by people who were related to terrorist acts. Thus, it is possible with the fraction of probability to predict such events or relation of tweets authors to the commission of terrorist acts. The result of program will be user's alert about detecting social network messages with anomaly or other words of detecting suspicious messages that may precede terrorist attacks. Ideally, the program's working can help to prevent a terrorist act and to save people lives.

What is an anomaly in text data?

Point anomalies appear in situations where a separate data sample can be considered as anomalistic with respect to the rest of data. This type of anomalies is rather rare.

Contextual anomalies [3] can be observed if a data sample is abnormal only in a certain context. For identification of the anomalies of this type, the key part is to detect contextual and behavioral attributes. The contextual attribute may be position in space or more complex combinations of data properties. The behavioral attributes may be the characteristics of data that are not contextual . Thus, an data sample can be considered as a contextual anomaly in one context and absolutely normal in another.

Collective anomalies [3] arise where the sequence of connected data samples (for example, a line segment of a series) is abnormal in relation to a whole dataset. A separate data sample in such sequence may not be an anomaly, but the joint appearance of such instances is a collective anomaly. In addition, while point and contextual anomalies can be observed in any dataset, collective is observed only where the instances are connected.

Descriptions of Main Material of Researching

Methods for problem solving.

Like any algorithm of machine learning algorithm, it is necessary to have initial data. The task complexity is in that how to choose the data structure, that perfectly (enough) represented documents, cost function.

There are several variants of classification methodologies for finding anomalies. Depending on the applied algorithm, the result of operation of anomalies identification may be label of data instance as abnormal or an estimation of degree probability that instance is abnormal.

The methodology of supervised recognition requires the existence of training sample that fully describes the system and data instances of normal class and abnormal class. The work of the algorithm takes place in two stages: learning and recognition. In the first stage, a model is constructed to compare data instances that don't have labels. The main complexity of algorithms that operate on the basis of the supervised recognition is the data forming of data for learning. Often an abnormal class is represented by a much smaller number of instances than normal, which may lead to inaccuracies in the obtained model. In such cases it is recommended to use the anomalies generation.

Partially supervised recognition mode. Initial data is only a normal class. After passing learning by a normal class, the system can define relevance of new data to it, thus the opposite. Algorithms that operate in

partially supervised recognition mode don't require information about an abnormal class of instances. As a result, such system can be broadly used, however such approach isn't very effective when it is necessary to detect particular class of anomalies.

Unsupervised recognition mode. Applied in the absence of a priori information about the data. Unsupervised recognition algorithms are based on the assumption that abnormal instances occur much less frequently than normal. The data is processed and the most remote instances are defined as abnormal. The problem definition.

The problem of text classification can be formulated as the task of approximation unknown function

0: D x C ^{0,1} (1)

in what way text documents may be classified) through function

K: D x C ^{0,1} , )

which is a classificator, where C = {1, C2,..., C|c|} - a set of possible categories, and D = {di, d2,..., d|D|} - a set of

documents.

'1, if dj e Ci

0

(dj,Ci ) =

,Ci) 1 0,ifd j £ ci

(3)

In our system, to solve the problem of real time recognizing abnormalities in data stream, the Isolation Forest[5] algorithm is used. The method by which the algorithm constructs a partition initially creates an isolation tree or random decision tree. Then, the estimate is calculated as the length of the path for the isolation of the observation. Tree is build based on a extracted keywords from tweets using RAKE algorithm. Those keywords are translated to feature vectors with a help of word2vec predefined models of Matlib before they can be used in DecisionTrees.

Software architecture.

The program is the integration of several powerful technologies. We take the data stream from site twitter.com by using Twitter.API, which transmits posts and all attributes of instances to the Apache Kafka[1] . Kafka is the system that works on principle "publisher-subscribers". Data stream goes to the broker, and server processiong it by RDD parties. Under the control of the zookeeper server, the consumer signs up for obtaining of incoming data stream. Then the zookeeper directs the stream from the consumer to the Apache Spark.

There, using the embedded Spark Streaming module, data analysis using the MLlib module is performed. A model is being built and taught of machine learning. Posts processing is performed using the Twitter API. Output of posts and notifications about anomalies detection is implemented in the interface. Spark is very well integrated with with the HBase database, where all the publications and instances attributes are stored. All this can be seen at deployment diagram.

Processing pipeline.

Tweets are consumed from the api by tweetsProducer. It extracts keywords from tweets content with rake algorithm and pushes them with some additional data extracted from tweet to kafka topic.

Figure 2. Processing pipeline.

Messages from kafka topic created by producer are consumed by tweetAnalyser. TweetAnalyser is an Spark application that uses trained word2vec[10] model to transform keywords to numeric vectors, these vectors are passed to isolation forest models to detect outliners. Outliners detected on previous step are then saved to MongoDb collection. Afterwards data about processed tweets are sent to another kafka topic processed by dashboardConsumer. Usage of multiple streams during processing.

Experiments

Proposed architecture was tested using tweets stream about Donald Trump. According to [9] twitter streaming API produces ~ 1% of tweets to end users. During testing we had an input rate 100 tweets per second.

Results of experiments are shown on figures 4 and 5.

On figure 3 presents the PieChart with totals of inline(normal tweets) and outline(abnormal tweets). It is updated at the real time with incoming data.

On figure 4 presents view saved outliner tweets with their content and links.

Conclusion

In this paper creation of realtime text stream anomalies analysis system is reviewed. Scalable software architecture are proposed. The Isolation Forest algorithm is used for text anomalies detection.

Figure 3. Software architecture.

/ ' V V V ^

Inlinen — Outlineis

Inliners V5 outlines

Figure 4. Anomalies detection.

Brand Live dashboard View outliner tweets

Show 100 » entries

Tweet id Created at Lang Text Tweet url

942152249786191872 2017-12-16T21:59:35 en JUST IN: Peter Strzok claims that his anti-Trurnp text messages didn't affect his investigation and were simply role-play for his mistress. In other words, he is lying. http://twitter.com/4645598232/status/942152249786191872

942152248133636097 2017-12-16T21:59:34 en Wow. This is massive. How many more rules does Mueller get to break before being shut down? SOURCES/ If you still need evidence of collusion, here it is. But as http://twitter.com/33905880/status/942152248133636097

942152248540499968 2017-12-16T21:59:35 en you read it, remember: (1) this is a 'fraction* of the evidence Mueller has: (2) this only needs to make you 1% fearful Trump is guilty—as that's enough to 100% support an investigation. Anyone who thinks that Trump should fired Mueller, needs to read http://twitter.com/851871276/status/942152248540499968

942152248121085952 2017-12-16T21:59:34 en this article. Mueller, not only is a lifelong Republican, but is an American hero in EVERY sense of the word. He genuinely wants what's good for America and is as honest as they come. http://twitter.com/790l 6425/status/942152248121085952

942152246963392512 2017-1216121:59:34 en Remarkable http://twitter.com/359536006/status/942152246963392512

942152245675560960 2017-12-16T21:59:34 en Senior White House Official says 'Thanks, but no thanks" to divesting from her private business interests. Must be nice when the http://twitter.com/348780432/status/942152245675560960

rules don't apply to you.

942152245432463360 2017-12- 1fiT?1-59-34 en #Mueller did an end run around the Trump team. The power of the snhnnpnnl (WAIPvWitt iHlMRNRC http://twitter.com/503334928/status/942152245432463360

Figure 5. List of Anomal tweets.

References

1. Apache Kafka documentation // https://kafka.apache.org/documentation

2. Big data processing using Apache Spark // https://www.udemy.com/learning-path-spark-data-science-with-apache-spark.

3. Chandola V., Banerjee A., Kumar V. Anomaly Detection: A Survey. // ACM Computing Surveys, Vol. 41(3), Article 15, 2009.

4. Election analyses //https://mkdev.me/posts/analiz-vyborov-v-ssha-s-pomoschyu-apache-spark-graphx

5. Isolation Forest // https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf

6. Near Real-time data processing // https://blog.cloudera.com/blog/2015/06/architectural-patterns-for-near-real-time-data-processing-with-apache-hadoop/

7. Spark Streaming documentation // https://spark.apache.org/docs/latest/streaming-programming-guide.html.

8. "Sarcasm Detection" // https://arxiv.org/pdf/1607.00976v2.pdf

9. How Twitter Samples Tweets in Streaming API // http://blog.falcondai.com/2013/06/666-and-how-twitter-samples-tweets-in.html

10. Tomas Mikolov, Ilya Sutskever, Kai Chen // Distributed Representations of Words and Phrases and their Compositionality // https://arxiv.org/pdf/1310.4546.pdf

i Надоели баннеры? Вы всегда можете отключить рекламу.