Научная статья на тему 'Naïve Bayes modification for text Streams classification'

Naïve Bayes modification for text Streams classification Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
93
12
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
КЛАССИФИКАЦИЯ / ТЕКСТОВЫЕ ПОТОКИ / НАИВНЫЙ БАЙЕСОВСКИЙ КЛАССИФИКАТОР / TF-IDF

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Lomakina L.S., Lomakin D.V., Subbotin A.N.

In this article Naïve Bayes method modification for text streams classification is considered. A real-time text stream classifying machine proposed.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Naïve Bayes modification for text Streams classification»

7._Лежнюк, П. Вплив iHBepTopiB СЕС на показники якосл елекгрично! енерги / П. Лежнюк, О. Рубаненко., I. Гунько// Вiсник Хмельницького нацюнального гехнiчного ушверси-гегу. Серiя: Техшчш науки - 2015 - №.2. - С.134-139.

8._Лежнюк, П. Вплив сонячних елекгричних сганцiй на напругу споживачiв 0,4 кВ / П. Лежнюк, О. Рубаненко, I. Гунько// Науковий журнал «Енергетика: економта технологи, еколопя - 2015 - №.3(51). -С.7-13.

9._Buslavets, O. Evalution and increase of load capacity of on-load tap changing transformers for improvement of their regulating possibilities / O. Buslavets, P. Legnuk, O. Rubanenko // Ea^ern-European journal of enterprise technologies - 2015. -No. 2/8 (74). - P. 35-41. - doi: 10.15587/1729-4061.2015.39881

10._Ковальчук, О. ГЕС в локальних елекгричних системах з розосередженим генеруванням /О. Ковальчук, О. НЫгорович, П. Лежнюк, В. Кулик // Пдроенергегика Укра!-ни - 2011. - №.1с.54-58, 2011

11._Побудова схем секцюнування розподшьно! елекгрично! мереж! напругою 6-10 кВ. Мегодичш рекомендаций СОУ-Н ЕЕ 40.1-00100227-99:2014. - Офщ. вид. - К. : ТОВ «Торговий дiм - «ЕЛВО -Украша», 2014. - 42 с.

12._Мельников, Н. Магричный мегод анализа элекгриче-ских сегей / Н.А. Мельников. - М.: «Энергия», 1996. - 120 с.

13._Enslin, J. Harmonic Interaction Between a Large Number of Di^ributed Power Inverters and the Di^ribution Network / J. Enslin, P. Heskes // IEEE Transaction on power electronics -2004. - vol. 19, no. 6 - pp.1586-1593

14._Dall'Anese, E. Di^ributed optimal power flow smart microgrids / E. Dall'Anese, H. Zhu, G. Giannakis // IEEE Transaction on power electronics - 2013. - vol. 4, no. 3 -pp.1464-1475.

15._Tong, J. A sensitivity-based BCU method for fa& derivation of lability limits in electric power sy^ems / J. Tong, H. Chiang, T. Conneen // IEEE Transaction on power electronics - 1993. - vol. 8, no. 4 - pp.1418-1428

16._Общие технические требования к системе ГРАМ гидроэлектростанций: СО 34.35.524-2004. - М.: ЦПТИ ОР-ГРЭС, 2004. - 5 с. Режим доступу: http://www.consultant.ru/ cons/cgi/online.cgi?req=doc;base=EXP;n=379347

17._Лежнюк, П.Д. Аналiз чутливостi оптимальних pi-шень в складнх системах кpитеpiальним методом. Моно-гpафiя. - Внниця: Унiвеpсум-Вiнниця, 2003. - 131 с.

NAÏVE BAYES MODIFICATION FOR TEXT STREAMS CLASSIFICATION

Lomakina L.S Lomakin D.V. Subbotin A.N.

Nizhny Novgorod State Technical University n.a. R.E. Alekseev МОДИФИКАЦИЯ НАИВНОГО БАЙЕСОВСКОГО КЛАССИФИКАТОРА Ломакина Л.С. Ломакин Д.В.

Субботин А.Н. Нижегородский государственный технический университет им. Р.Е.Алексеева ABSTRACT

In this article Naïve Bayes method modification for text breams classification is considered. A real-time text йгеат classifying machine proposed. АНОТАЦ1Я

В статье рассмотрена модификация наивной байесовской классификации потоков текстовой информации. Предложен реальный классификатор, позволяющий обрабатывать текстовые потоки в режиме реального времени. Keywords: classification, text breams, naïve Bayes classificator,, tf-idf.

Ключовi слова: классификация, текстовые потоки, наивный байесовский классификатор, tf-idf.

Today, the Internet has become the main source of information. Appropriate flructuring of the Internet' content allows to make work with it more efficient and simple. One of the ways appropriate content flructuring could be reached is classification.

Solving classification problem is important when we concern large amounts of incoming information which are too hard to

process manually, especially when these amounts come in flreams, like in text messages we receive from the news sites. Thousands of text new messages are generated by news sites in the Internet. In order to represent it in easy-to-read and underflandable form we have to perform thematic classification to define what thematic each message has.[2]

Fig 1. Classifier topology

In this article text flream is defined as text messages arriving from the source at random times.

Naïve Bayes for text classification

In Naïve Bayes classifier each document is viewed as a set of terms and the order of terms is irrelevant. The probability class value c given a tefl document d equals: P(c|d)=P(cj ) nH|TdP(tk |c) where Td is set of terms in document d. [4] For set of classes cj and set of documents tk the rule which decides which document belongs to which document is computed as:

cres=argmax P(cj|di)=argmaxP(cj) nk=i|Tdi|P(tk |cj) The class which has highefl probability measure is set as the class to which current document belongs to.

A priori class probabilities P(cj) are calculated as a fraction of numbers of entries of this class in training set to all the training set size.[6]

This method allows to classify documents relatively fafl, which meets real time processing requefl. Training algorithm here is a process of finding a priori class possibilities for each term entering the training set.[5] Proposed modification

In traditional Naïve Bayes classifier term frequency is used to define term' a pofleriori possibility, meaning number of times a term found in a documents of the class divided by total number of class terms.

Fig 2. Text flream flructure

P(tk|cj )=(tf(tk,cj))/(Zi=i|T|tf(ti,cj),

Term frequency is a good indication of term' importance for the thematic, but this approach doesn't considers this term' frequency in other classes, while there are word which are widely used in more than single sphere. This actually worsens classification quality.[5]

In proposed approach term's weight is calculated in the following way:

The more frequently the term is used in current class' articles the more important it is for this class

The more frequently the term is used in all classes' articles the less important it is for the class.

It is proposed to use term' tf-idf metric to define its weight for the class, because it meets the rules mentioned above. For term t, corpus D and class cj tf-idf will be calculated as: tfidf(t,D,cj )=(tf(tk,cj))/(Si=1 |T 1 tf(ti,cj) log|D|/|di>ti | , where the right part is inverse document frequency and is used to decrease weight of commonly used words.[1] Algorithm implementation

In this work we developed program allowing to get news sites' text message breams from the Internet and to perform text breams classification.

^Oi^lVP НОВОС1НЫХ

Классифицированные новости Е RS S рассылки

Bp™ Г^МСШКПЯ i£jfac /ml rbi ru it-: rüwaii.1 пл.гЬс ru'rtx ru ma л ЫЁр/АМкАШбыиДИАвеяйтМк тЛЬсл/рнП Nip -V lUc JWd rtK ni." ibo- rH rvurtm wèuu1 oc«r http ^iMKJ^riJtK^ft^i^pi ЧфЛ'ЯШМ JM >bt Ju" lh/«M4th*Ac ru- ibtu' Nîp '.-Jt-e rbr ru it-: --^r'-jl. m.rte ru. rbc ru lc-гг

■^JtfJ» I4JHÎ-+JW мдедошдео 1 if-

¿над ¿Ш» h-rr-wi (L <UU ■ (rff-M . 1 Ii ~ g «il !<Vpii*-Mr Luiiu. «елныгрьц

2U* it Ucunu HtSMmrpjiAX nfenciun ic-pc-TEE- . rkwTMU

МИ ■wwJbмы L Ï-CÇHM > K-k-:i utA'i пс-р-г-й гс л ■ ■ r-f-r Сюр! Wf>/f*mt Ф( rsi1 IWfrllgiq

212ГМ Шит ДО»* Г»Ш>ТИЛ, 4IO IBWI.P-Pfl! Пр_ Премшта-гя кцпУ.' -JüAc JWd At ju" rtKj' ni rei rfcc jWrwjpon _г»ша

JlJfti? Iii» J-чч F - j - ир^-iwnk" 14 ГАМ- МшТС*» lopu№i|MUliflflC httpj'.liwuw'rwb'vJLHtil ,

Z£№4 HUA К» tji.1 1 : «rt ¿X «fM(ui4 .

ZM2K 1мг К |<««1С CUM 4 ч-v-- iMp Jürf-MfO-i

-fv ïnrii H&Uul «.igpcd 1 i.-fv Ci^fi

■|~tl -j ¿KT|iHrtP*№<ia FI*»"* ce-ae«?»

Un» Чйсл&ашрт1 ne ■ j;l -J ф < « no^uu.

JiiltH UcLPJ я »iii'i исаза ma Россм ..

vww rt-: rj Е- здш кдафиикго ■ тр_ Псяггим Je*. I

-«'АЛ ■Ф^-So«. Цсыпкрч* Eipc^ - Ун_ Сечтрт

■с |ШИ1..| Г>.! ...... ЛЛц—■ >

Fig. 3. Program interface.

This program uses modified Naïve Bayes method to classify incoming news text breams. Text breams are taken Sraight from the Internet, а Точнее from news sites which have RSS feeds. RSS is a family of XML format, which is used to get brief information about the site it placed in.

Talking about news sites, RSS feeds always contain of la& news which were recently po&ed by this site. This includes titles, time and date the message was po&ed and http link to site' page containing full text of this message. If the site is in program's watch li& it repeatedly downloads the RSS file of watched site in order to keep track of la& po&ed news. If it finds out that there is a new message, it uses http link to download full html code of that message. Further, parsing algorithm is applied to html code to retrieve full plain text of the message. Then this text is filtered to get rid of Sop words. Further it uses modified Naïve Bayes method to determine which thematic class this message belongs to.

Empirical results

Algorithm efficiency was tefled on corpus of russian newspapers message texts. Articles of train set belong to the same corpus as the tefl set, but they don't interfere. Further classification of these articles was performed using the traditional Naïve Bayes classifier (NBC) and the modified Naïve Bayes classifier (MNBC). In the firfl case term probability eflimation was calculated using term frequencies. In second case - using terms' tf-idf metrics. The experiment was performed for the training sets of 500, 1000, 2000, 3000 and 5000 texts.[4]

As the characteriflics that describe the quality of the classification used in these assessments as completeness (recall), accuracy (precision) and F1-measure.C

Classification quality was eflimated using traditional characteriflics such as F1 score, precession and recall. The results are shown in following tables.

Table 1.

F1 score, precession and recall for maximal training set.

Recall Precession F1-score

NBC 0.6 0.6664 0.6348

MNBC 0.675 0.7385 0.7053

Table 2.

F1 score for different training sets sizes.

Articles in training set F1 score

NBC MNBC

500 0.16 0.21

1000 0.24 0.286

2000 0.376 0.44

3000 0.506 0.573

4000 0.59 0.68

5000 0.6348 0.7053

Modified Naïve Bayes classifier has shown better results Graphs indicating algorithm' efficiency are shown below. comparing to classic Naïve Bayes. F1 score of modified Naïve Bayes have reached 0,7.

Fig 5 Experimental algorithm efficiency e^imations

Modified Naïve Bayes showed better results than classic Next graph indicates f1 score of classification algorithms Naïve Bayes in all classification characteriflics. depending on training set size.

0|,1 score

0 1000 2000 3000 4000 3000 6000

Training set size, articles

Fig. 6. F1 score for different training set sizes.

F1 score for modified Naïve Bayes classifier is higher than for classic Naïve Bayes on all available training set sizes. Maximal f1 score is reached for modified Naïve Bayes on maximal training set size (5000 articles) and equals 0.7053 which is 6% higher than f1 score for classic nbc (0.6348) on this training set size.

Conclusion

In this article we proposed modification for Naïve Bayes classifier which uses terms' tf-idf metric to eflimate term' class probability. Empirical results show us that modification increases efficiency of the classifying algorithm for all training set sizes.

As a result, software, allowing to get news sites text flreams from the Internet and thematically classify it in real-time.

References

1. Advanced Science and Technology Letters Classification Scheme of Unflructured Text Document using TF-IDF and Naive Bayes Classifier http://onlinepresent.org/proceedings/ voll 11_2015/50.pdf

2. Gaber М. М., Zaslavsky A., Krishnaswamy S. A Survey of Classification Methods in Data Streams // Data Streams / Ed. by Aggarwal С. C. Springer US, 2007. P. 39-59.

3. Д.В. Ломакин Л.С. Ломакина, А.С. Пожидаева Верояг-носгь. Информация. Классификация: учеб. пособие; Ниже-гор. гос. гехн. ун-г им Р.Е. Алексеева. — Н.Новгород, 2014. - 128 c.

4. Машинный фонд русского языка. // http://cfrl.ruslang.

ru/

5. Ljunglof P., Wiren M. Syntactic Parsing // Handbook of Natural Language Processing, Second Edition. 2nd ed. / Ed. by lndurkhva N., Damerau F.J. Chapman and llall/CRC, 2010.P. 5992.

6. Л.С Ломакина, А.С. Суркова Информацинные гехно-логии анализа и моделирования гексговых данных г. Воронеж - Научная книга, 2015. - 208 c.

i Надоели баннеры? Вы всегда можете отключить рекламу.