Научная статья на тему 'COMPARATIVE STUDY OF NAIVE BAYES CLASSIFIERS IN BREAST CANCER DATABASE'

COMPARATIVE STUDY OF NAIVE BAYES CLASSIFIERS IN BREAST CANCER DATABASE Текст научной статьи по специальности «Медицинские технологии»

CC BY
35
5
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
NAïVE BAYESIAN / CROSS VALIDATION / BREAST CANCER / ACCURACY / SENSITIVITY / SPECIFICITY

Аннотация научной статьи по медицинским технологиям, автор научной работы — Ruzibayev O.B., Yaxshibaev D.S.

Breast cancer is known as the most common invasive cancer type among women, therefore automatic breast cancer detection systems are in demand. Currently pattern recognition is one of the most important methods in a number of different practice areas in solving such a long-standing worries. Pattern recognition is one of the widely used classification methods in this field. This paper describes comparative results in breast cancer database, using two Naïve Bayesian (NB) methods. The experiments were realized with 10-fold cross validation test. Obtained evaluation values according to the experiments, comparative study the weighted NB and Naïve Bayesian are described in this article.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «COMPARATIVE STUDY OF NAIVE BAYES CLASSIFIERS IN BREAST CANCER DATABASE»

ны технологические и программно-аппаратные особенности их реализации. Предложенная структура системы отражает общую концепцию, планируется дальнейшая проработка вопросов, связанных с реализацией как всей системы в целом, так и ее отдельных компонентов.

Список литературы:

1. Об утверждении государственной программы «Развитие здравоохранения» [Электронный ресурс]: Постановление Правительства Российской Федерации от 15.04.2015 г. № 294 - Режим доступа: http://government.ru/me-dia/files/NfyPj24TXpc.pdf (дата обращения: 28.04.2016).

2. Федотов А.А., Акулов С.А. Измерительные преобразователи биомедицинских сигналов систем клинического мониторинга: учебное пособие. -М.: Радио и связь, 2013. - 250 с.

3. Борзенко А. Обзор биодатчика Rooti W/Me2: медбрат на запястье [Электронный ресурс] / А. Борзенко // «Вести» интернет-газета». - 10.12.15. -Режим доступа: http://hitech.vesti.ru/news/view/id/8259; http://geektimes.ru/ post/212811/ - Заглавие с экрана (дата обращения: 10.03.2016).

4. Сычев И. История пульсометров [Электронный ресурс] / И. Сычев // Geektimes. - 19.02.2014. - Режим доступа: http://geektimes.ru/post/212811/ -Заглавие с экрана (дата обращения: 10.10.2015).

5. Самарин А. Электроника, встроенная в одежду - технологии и перспективы / А. Самарин // Компоненты и технологии. - 2007. - № 4. - С. 221-228.

6. Калакутский Л.И., Манелис Э.С. Аппаратура и методы клинического мониторинга: учебное пособие. - Самара: Самар. гос. аэрокосм. ун-т., 1999. -161 с.

7. Рудычева Н. Обзор: ИТ в здравоохранении 2015 [Электронный ресурс] / Н. Рудычева // CNews Аналитика - 02.06.2015. - Режим доступа: www.cnews.ru/reviews/publichealth2015/articles/it_v_zdravoohranenii_20 15_glav-naya_rol_otdana_regionam - Заглавие с экрана (дата обращения: 20.03.2016).

COMPARATIVE STUDY OF NAIVE BAYES CLASSIFIERS IN BREAST CANCER DATABASE

© Ruzibayev O.B.1, Yaxshibaev D.S.2

Tashkent university of information technologies, Uzbekistan

Breast cancer is known as the most common invasive cancer type among women, therefore automatic breast cancer detection systems are in demand. Currently pattern recognition is one of the most important methods in a

1 Ассистент кафедры «Программное обеспечение информационных технологий».

2 Старший преподаватель кафедры «Высшая математика».

number of different practice areas in solving such a long-standing worries. Pattern recognition is one of the widely used classification methods in this field. This paper describes comparative results in breast cancer database, using two Naïve Bayesian (NB) methods. The experiments were realized with 10-fold cross validation test. Obtained evaluation values according to the experiments, comparative study the weighted NB and Naïve Bayesian are described in this article.

Keywords: Naïve Bayesian, cross validation, breast cancer, accuracy, Sensitivity, Specificity.

1. INTRODUCTION

In the supervised classification problems, Naive Bayesian classifier [1] is a simple and efficient probabilistic model based on the Bayesian theory. This paper discusses two well known classification methods Naïve Bayesian (NB) and weighted Naïve Bayesian (WNB) methods, using MATLAB implements them on selected breast cancer database and analyses the experimental results.

2. NAIVE BAYES CLASSIFIER

Let us consider a dataset x = [x1, x2, ..., xn] is a data sample, which has n attributes and its category is unknown. Suppose that the decision attributes takes values from C = (c1, c2, ..., cm} into Cj classes. Bayes classifier is to predict the class Cj for a new instance x, according to the evidence provided by a set of training instances for which both the attribute and class values are known. An independence assumption is embodied in the Naïve Bayesian classifier which assumes that each attribute xt is conditionally independent of all other attributes, given the class C, this yields

n

P(x\cj) = P(x,,x2,...,Xn \c )nP(X \cj). (1)

i=i

Given the conditional independence assumptions made by Naive Bayesian Classifier and P(x,|Cj) is constant to each category, we can simplify (1) and the result is:

n

CNB(X) = argmaxnP(x \cj)P(c.). (2)

i=i

2.1. Weighted naive Bayesian classifier

Naive Bayes conditional independence assumption is difficult to meet, can be assigned to different attributes of different weights, then Naive Bayes can be extended The formula of weighted NBC as follow:

n

CWNB (X) = arg max n wkP(xk \c. )P(c,.). (3)

i=i

Where wk denotes the weight of attribute xk. The larger the weight, the rasp attribute has the effect on the classification.

3. DATA NORMALIZATION

To deal with this kind of dilemma the values of the parameters are normalized (standardized) and then distances are computed. Conversion can be achieved using this formula below [2]:

x — x

ij i min ...

zv =—-• (4)

x

i max

Here will be in the range of, i.e. 0 < zy < 1. This method of conversion is used by most mathematicians and programmers. But in our software we used the statistic method of conversion given below:

z = ^ (5)

a

Here, is the mean (average) of, and is the standard deviation from mean.

n

Us

n

U xi rn-

/ = ——, a = U (x — /)2, n - number of points.

a

After standardization the average of the vector will be equal to (zero), and standard deviation of the vector will be equal to.

4. CROSS VALIDATION

Cross Validation is based on the principle that testing the algorithm on a new set of data yields a better estimate of its performance [3]. Most real applications have a limited amount of data. Because of this the dataset is split into the training sample and the validation sample.

4.1. Pseudo code module classifier using k-fold cross validation

INPUT: .Split Lists OUTPUT: Accuracy of Classifier 1: BEGIN 3: {

4: SET no Of Splits 5: FOR no Of Splits=k 6: {

7: Set ccorrect Classification Count to 0 Splitlists 8: Call Set Training Set and Test Set with

9: FOREACH Test Instance Tt in Test Set 10: {

11 : CALL Train Classification with Training Set

12: CALL Classify Test Data with Set Ti

13: IF classification = Ti Class THEN

14: INCREMENT correct Classification Count

15: ENDIF

16: }

5. EXPERIMENTAL RESULTS

An implementation diagram of breast cancer database diagram is shown in Figure 1, this overall diagram can be breast cancer database Using Naïve Bayes classifiers viewed as three stages.

Fig. 1. Block diagram of overall Naïve Bayes classifiers

Step 1 : The data database used for experimental purpose is downloaded repository web site http://archive.ics.uci.edu. The database is composed 699 objects, 9 features, and is composed of two classes. The database represents two classes. These database presented in table 1.

Step 2: These classifiers are divided into two different categories: (1) Naive Bayesian algorithm (2) weight Naive Bayesian algorithm.

Step 3: Performance measures are used to evaluate the classification efficiency, evaluated in terms of correct classification rate (%), sensitivity and specificity.

The comparative performance of the NB and weight NB classification methods are evaluated with sensitivity, specificity and accuracy tests. True Negative (TN) is the number of instances correctly identified as benign; FN (False Negative) is the number of instances incorrectly identified as benign; TP (True Positive) is the number of instances correctly identified as malignant; False Positive (FP) is the number of instances incorrectly identified as malignant. Classification accuracy, sensitivity and specificity value can be defined by using the confusion matrix is shown in Table 2. A confusion matrix describes a number of correct and incorrect predictions identified by a classification system. Sensitivity is defined as the percentage of actual positives which are correctly classified. Specificity is defined as the percentage of actual negatives which are correctly classified. Accuracy is calculated using sensitivity and specificity.

Table 1

Attributes and values of cancer

№ Attribute (Value)

1 Clump thickness 1-10

2 Uniformity of cell size 1-10

3 Uniformity of cell shape 1-10

4 Single Epithelial cell size 1-10

5 Single Epithelial Cell Size 1-10

6 Bare Chromatin 1-10

7 Bland Chromatin 1-10

8 Normal Nucleoli 1-10

9 Mitoses 1-10

10 Class 2 (Benign) or 4 (Maligant)

TP

Sensitivity (%) =--100,

TP + FN

Specificity (%) =—TN--100,

TN + FP

TP + TN

Accuracy (%) =--100.

TP + FN + TN + FN

Table 2

Confusion matrix for obtained results

Classified Actual

Actual Positive Negative

Positive (TP) (FP)

Negative (FN) (TN)

The accuracy is obtained using the data for the training and validation of an algorithm of stratified 10-fold cross-validation. Table 3 shows the accuracy comparison. The constructed confusion matrix is given in Table 2 and the calculated accuracy values for each fold are given in Table 3.

Table 3

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Accuracy comparison results

Classifiers TP FP TN FN Senst Spec Acc

Naïve Bayes 431 13 232 7 98.4 94.69 97.07

Weighted Naïve Bayes 223 8 450 2 99.11 98.25 98.54

Acc - Accuracy (%), Senst - Sensitivity (%), Spec - Specificity (%).

6. CONCLUSIONS

As a conclusion, I can give there some statistical information that is compared the error of the prediction data from the website of the University of California archive.ics.uci.edu. This paper, comparative to Naïve Bayes algorithms: Naïve Bayesian and Weighted Naïve Bayesian algorithms are presented for the breast cancer database with table results (table 3.). In this case, of the algorithm Weighted Naïve Bayesian the results can be considered better than others.

References:

1. Panechnikov VA Non-parametric estimation of a multidimensional probability density // Probability theory and its application. 1969. T. 14, No 1. P. 156-161.

2. Gaydyshev I.P. Analysis and data processing: a special directory // I.P. Gaydyshev. - SPb.: Peter, 2001. - 762 p.

3. Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection // The Fourteenth International Joint Conference on Artificial Intelligence: proceedin g s. San M ateo, CA, 1995. No 2. P. 1137-1143.

4. Orlov A.I. Non-numerical statistics. M.: MZ-Press, 2004.

i Надоели баннеры? Вы всегда можете отключить рекламу.