Научная статья на тему 'The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier'

The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
179
24
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
КЛАССИФИКАЦИЯ / РУБРИКАЦИЯ / КЛАСТЕРИЗАЦИЯ / ОБРАБОТКА ТЕКСТОВЫХ ДОКУМЕНТОВ

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Lytvynov V. V., Moyseenko O. P.

Описываются механизмы обучения и оценки качества работы классификатора в разрабатываемой системе автоматизированной обработки больших объемов текстовой информации. Классификатор базируется на свободной программной библиотеке LibSVM и методе опорных векторов. Система выполняет функции поиска, классификации, рубрикации и кластеризации текстовых документов по запросам пользователя.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

The mechanisms of teaching and evaluation of the performance of the classifier in the developing system of the automated processing of large volumes of textual information are described. The classifier is based on the free software library LibSVM and support vector machines. The system performs the functions of search, classification, categorization and clustering of text documents at the request of the user.

Текст научной работы на тему «The mechanisms of teaching and evaluation of the quality of performance of the text documents classifier»

УДК 004.912: 004.632

V.V. LYTVYNOV*, O.P. MOYSEENKO*

THE MECHANISMS OF TEACHING AND EVALUATION OF THE QUALITY OF PERFORMANCE OF THE TEXT DOCUMENTS CLASSIFIER

*Chemihiv National University of Technology, Chemihiv, Ukraine

Анотація. Описані механізми навчання та оцінки якості роботи класифікатора в розроблюваній системі автоматизованої обробки великих об'ємів текстової інформації. Класифікатор базується на вільній бібліотеці LibSVM та методі опорних векторів. Система виконує функції пошуку, класифікації, рубрикації та кластеризації текстових документів за запитами користувача. Ключові слова: класифікація, рубрикація, кластеризація, обробка текстових документів.

Аннотация. Описываются механизмы обучения и оценки качества работы классификатора в разрабатываемой системе автоматизированной обработки больших объемов текстовой информации. Классификатор базируется на свободной программной библиотеке LibSVM и методе опорных векторов. Система выполняет функции поиска, классификации, рубрикации и кластеризации текстовых документов по запросам пользователя.

Ключевые слова: классификация, рубрикация, кластеризация, обработка текстовых документов.

Abstract. The mechanisms of teaching and evaluation of the performance of the classifier in the developing system of the automated processing of large volumes of textual information are described. The classifier is based on the free software library LibSVM and support vector machines. The system performs the functions of search, classification, categorization and clustering of text documents at the request of the user.

Keywords: classification, categorization, clusterization, processing of text documents.

1. Introduction

The aim of classification (thematic categorization) of electronic natural language documents, i.e. classification of the texts content to one or several thematic sections, is currently important due to the continuous growth of stored or transmitted text data.

In theory, the solution of the documents classification task involves the presence of a certain plurality of electronic documents D={di}, that has to be separated into several nonintersecting, thematically homogeneous subset (classes, С) and defining to which class each document from the total mass of documents to be processed should be classified [1].

C = {C,} U*C, d = Dxс, nCj = 0(i * j). (1)

2. Problem statement

The objects of the research are:

• a relatively large text collection of several hundred documents, previously separated by content into thematic groups (classes/sub-collections);

• the mechanisms of text data analysis in the system of natural language documents processing.

The tools are:

• the developed system of processing of multilingual, dynamic flows of text data on the base of support vector machine algorithm (SVM) [2], implemented in the free library LibSVM;

• implementation of SVM in the module Machine Learning (Support Vectors), the product of the company StatSoft, STATISTIKA 8.0.

© Lytvynov V.V., Moyseenko O.P., 2014

ISSN 1028-9763. Математичні машини і системи, 2014, № 4

53

As a result of theoretical and practical experiments, it will be possible to investigate more thoroughly the processes of study and testing of the classifier in the system of “Processing of high-speed information flows of text data”.

3. Problem solution

By the example of the method of support vector machines (fig. 1), the model of the text documents classifier can be presented as:

R =< D, C, F, Rc, f >, (2)

where D - plurality of documents that need to be classified;

C - plurality of thematic rubrics (classes) C = {ci}, i = 1..NC, N - number of possible

rubrics;

F - plurality of rubrics descriptions. Each class Ci has its distinctive description Fi;

Rc - ratio C x F, to check the single description of each rubric.

"c є C$ F,є F :(c, F,) є Rc;

f - function $dє D: f (d) = Cd c Cn | Ct |> 1, i.e. the process of classification of objects

d Є D in the result of which the correspondence of a specific document d to one of the descriptions Fi and its assignment to the rubric Ci are defined. According to this function, elements of the plurality of documents can be assigned to several thematic rubrics at the same time. To minimize the number of such cases the classifier has to be properly taught before usage.

The popular in text data classification tasks collection of English short financial and stock documents Reuters-21578 [3] of the eponymous information agency has been used in the research. As it is seen from the name, this collection consists of 21 578 documents. Some of the documents are marked as not properly categorized, that is why only 12902 documents are used in practice. The corpus of texts is presented in the form of both txt and xml files. The collection is a part of the first volume of categorized documents of the information agency Reuters that is abbreviated as RCV1 (Reuters Corpus Volume 1) [4]. In its original form the set of text documents of Reuters-21578 includes 135 thematic rubrics, 56 names of organizations, 267 different personalities and 175 geographical names. The documents are collected in 21 xml-files and are presented in the following way:

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5545" NEWID="2">

Fig. 1. The general scheme of work of SVM classifier where, dots are the vector representation of two thematically different subsets, pluralities, classified NL documents; k - some function of the nucleus, that allows to separate thematic classes so that a separating plane could be drawn; w - support vectors on the base of bordering documents; £ - the introduced variable error to assess the classifier; b - the distance between the separating pluralities plane and the beginning of coordinates; w - support

vector

54

ISSN 1028-9763. Математичні машини і системи, 2014, № 4

<DATE> date of publication </DATE>

<TOPICS/>

<PLACES>

<D> location </D>

</PLACES>

<PEOPLE/>

<ORGS/>

<EXCHANGES/>

<COMPANIES/>

<UNKNOWN>more information </UNKNOWN>

<TEXT>

<TITLE> topic </TITLE>

<DATELINE> origin </DATELINE>

<BODY >text</B ODY>

</TEXT>

</REUTERS>;

For teaching and evaluation of the quality of performance of SVM classifier the method ModApte split has been used, that involves the separation of documents plurality of Reuters-21578 collection into the subset for teaching - 9603 documents (74% of the total amount) and the subset for testing of the chosen method of machine teaching with 3299 documents (26% off the total amount). ModApte separation is recommended to use to compare results of work of several classifiers.

4. Experimental part

The developing system is based on one of the variety of existing implementations of the support vector machines method, namely the free library with nonlinear nucleuses - LibSVM [8, 9]. Preference to this library to the library of the same developers LibLinear, that is implementing a quick linear classifier SVM, was given due to the work with small text corpuses and the possibility of occurrence of the situation of linear inseparability after a change of documents collections by including documents in other NL. The mechanisms of SVM algorithm implementation in the program product Statistika are not known. In the available software version there is one module that implements this method for the tasks of classification and categorization for any text corpus-es.

The quality of the classifier work depends on the correct presentation of processed documents in the form of a vector model [10, 11]. Each document from the collection of such model is presented as a plurality of terms (words, word combinations, numbers and other elements of which a document consists). According to the mentioned laws of Zipf, a certain weight can be specified to the terms from the collection, i.e. how important this term is for the document characteristic. For the presentation of a document in the vector space, the weights of all terms of the collection in regard to this document are denoted. The dimension of the document vector will be equal to the total amount of all terms outlined from the collection.

d = (w1j , *2,.-. W„). (3)

where d, - vector presentation of j document, w, - weight of i term in j document, n - total amount of terms.

Thanks to such presentation of documents they can be compared by finding the distance between vectors of the space (Euclidean distance or Mahalanobis distance). The smaller the distance is, the greater probability of thematic similarity between the documents.

In the system of automatic processing of text data flows on the base of LibSVM library the following functions of nucleus are possible that implement the linear separation of classified subjects:

ISSN 1028-9763. Математичні машини і системи, 2014, № 4

55

<Label> - nucleus identifier. Examples of functions:

0 - linear к(x, w) = sign(< x • w >).

1 - polynomial к(x, x') = (x • x')d .

2 - radial basis function, к(x, x') = exp(-g || x - x ||2), for g > 0.

3 - sigmoid к(x, x') = th(kx • x + c) for к > 0, c < 0 б

where К - nucleus function, xx' - scalar product of vectors, у - mapping of a vector from the space of features Rn into another space, d - degree, к and с - parameters, w - weights of features. <Index1>: <V alue1>

<Index2>: <V alue>

Index - number of the vector coordinates, Value - value of the vector.

There are several standard ways of weight determination of a term in a document:

a) Boolean weight - 1, if the term is in the document, and 0 if it doesn’t occur;

b) Term Frequency (TF) - the frequency of the term occurrence in the document;

c) Term Frequency - Inverse Document Frequency (TF-IDF) - the frequency of the term occurrence in the document at the amount that is inverse to the number of documents in which this term occurs;

d) Pointwise Mutual Information (PMI) - all negative weights are replaced by zero.

For cleanliness of the experiment in the developed system the tf-idf method of determining terms weights is used as it is used in the software Statistika [8]:

w

кі

(1 + log( Ntl )log(NJ))

MtogNiT+i)7

M s їк

(4)

where - number of occurrence of k term in the і document, N, - number of occurrence of k

term in all documents,

D

- number of documents in the collection.

Taking into account the possibility of cases of linear inseparability of classified objects into the equation describing the hyperplane that separates classes of documents in the space D, the variable error is introduced X - 0.

У (wdt - b) > 1-X, (5)

where yi - number equal to 1 in the case the vector di refers to the rubric we are interested in and -1 if it doesn’t;

w - support vector;

b - boundary value of the distance between the separating hyperplane and the beginning of the coordinates;

wdi > b ^ yi = 1; wdi < b ^ di = -1.

It is supposed that if X = 0, there is no error in the document di. If Xt > 1, there is an error in the document. If 0 < X < 1, the object di falls within the band of the separating plane.

The task of the classifier teaching is to solve the issue of optimisation of the function separating plane using the method of Lagrange [13]:

if at the point x relative minimum of the original objective function is achieved, then under condition there is the equality 0 derivatives with respect to x of the new objective function,

56

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

ISSN 1028-9763. Математичні машини і системи, 2014, № 4

there exists a set Д, that at the same point x the minimum of the new objective function is attained, but globally for all x. At that for each Д the following is true:

either Д is equal to 0 and the corresponding constraint is not active, or Д is not equal to

0 and the corresponding constraint is satisfied, but then this is already the equation.

Formulating this task in terms of Lagrange method, it turns out that it is necessary to find the minimum of w, b, X and the maximum of Д of the function:

+ CZ £ -Z Д+ У (wd -b)-!) при x > О,Д > 0. (6)

2 i i

If Д > 0, then the document of the teaching collection dt is called the support vector. After these manipulations the optimized separating hyperplane equation looks as follows:

ZДм*d - b = 0, (7)

i

where dt - document to be categorized.

As a numerical evaluation of the classification by both systems, the traditional set of metrics for a given issue was used: Accuracy (A), Precision (P) and Recall (R).

The first metric shows the general picture of the classifier performance, calculating the ratio of documents properly distributed by the classifier to total.

M

A =-----100%, (8)

N

where М - the amount of correctly classified documents, N - the total amount of documents.

The metric of precision indicates the relation of correctly classified documents to a particular class and of all documents referred to this class.

P =

TP

TP + FP

■ 100%.

(9)

The metric of recall is the relation of correctly classified documents to a particular class and all documents belonging to this class in the test sample.

TP

R =----------100%. (10)

TP + FN

The formulas of recall and precision metrics are constructed on the basis of contingency tables compiled for each of the possible classes.

Table 1. Variant of the classifier evaluation

Evaluation of the results by the classifier Evaluation of the classification results by an expert

True False

True TP (true-positive) FP (false-positive)

False FN (false-negative) TN (true-negative)

The calculation of recall and precision is conducted separately, not joining them in the popular metric of F-measure (11), which shows generalized assessment of the classifier performance.

ISSN 1028-9763. Математичні машини і системи, 2014, № 4

57

P • R

F = 2-----100%.

P + R

(11)

5. Results

After teaching and test categorization the classifiers of the tested systems showed the following results.

For the texts corpus of 3299 documents from Reuters-21578 collection the developed system based on the free library LibSVM and program product Statistica has given the evaluation.

Table 2. Results of the evaluation

System Accuracy, % Precision, % Recall, %

developed 93 80 94

Statistika 89 75 75

In the table there are average values of the metrics for the developed system with step-bystep application of nucleus functions mentioned previously and realized in the library LibSVM. The classifier based on the support vector machines algorithm, implemented in the product of StatSoft company allows automatically determine the most suitable nucleus function for classification of concrete objects, thus the figures obtained are considered as average and optimal for this classifier.

6. Conclusions

The classifier of the developed system of processing text data flows on the base of free library LibSVM has shown better results in comparison to the module Machine Learning (Support Vectors) of the system Statistika. This may be caused by both: difference of approaches to texts processing (markup, normalization) and choice of the nucleus function. It is planned to improve the classifier performance evaluation on mixed collections.

REFERENCES

1. Joachims T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features / T. Joachims // Proc. of ECML-98, 10th European Conference on Machine Learning. - Dortmund, 1998. - P. 137 - 142.

2. Вапник В.Н. Восстановление зависимостей по эмпирическим данным / Вапник В.Н. - М.: Наука, 1979. - 448 с.

3. Коллекция документов Рейтерс [Электронный ресурс]. - Режим доступа:

http://ronaldo.cs.tcd.ie/esslli07/data/reuters21578-xml.

4. Коллекция документов Рейтерс [Электронный ресурс]. - Режим доступа: http://www.ai.mit.edu/proiects/imlr/papers/volume5/lewis04a/lvrl2004 rcv1v2 README.htm.

5. Новостная коллекция РОМИП [Электронный ресурс]. - Режим доступа:

http://romip.ru/ru/collections/news-collection.html.

6. Емпирические законы Зипфа [Электронный ресурс]. - Режим доступа:

http://artprom.net/article/read/zakon Zipf.html.

7. Куняев Н.Н. Конфиденциальное делопроизводство и защищенный электронный документооборот / Куняев Н.Н., Демушкин А.С., Фабричнов А.Г. - М.: Логос, 2011. - 452 с.

8. Библиотека LibSVM [Электронный ресурс]. - Режим доступа:

http://www.csie.ntu.edu.tw/~cilin/libsvm.

9. Литвинов В.В. SVM при классификации мультиязычных текстов / В.В. Литвинов, О.П. Мойсе-енко // Весник ЧНТУ. - 2013. - № 4. - С 59 - 64.

10. Векторная модель коллекции документов [Электронный ресурс]. - Режим доступа: http://www.machineleaming.ru/wiki/index.php?title=Векторная модель.

58

ISSN 1028-9763. Математичні машини і системи, 2014, № 4

11. Нейлор К. Как построить свою экспертную систему / Нейлор К. - М.: Энергоатомиздат, 1991. -286 с.

12. Боровиков В.П. Программа STATISTICA для студентов и инженеров. - [2-е изд.]. - М.: КомпьютерПресс, 2001. - 301 с.

13. Лифшиц Ю. Метод опорных векторов (Слайды) — лекция № 7 из курса «Алгоритмы для Интернета» [Электронный ресурс]. - Режим доступа: vurv.name/intemet/07iah,pdf.

14. Крулькевич М.И. Информационная деятельность в организациях / М.И. Крулькевич, Е.М. Сын-кова. - Донецк: ДонНУ Украины, 2001. - 176 с.

Стаття надійшла до редакції 20.08.2014

ISSN 1028-9763. Математичні машини і системи, 2014, № 4

59

i Надоели баннеры? Вы всегда можете отключить рекламу.