Научная статья на тему 'Extracting the hidden regularities on latent features by using interval methods in pattern recognition problems'

Extracting the hidden regularities on latent features by using interval methods in pattern recognition problems Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
148
53
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
PARTITION INTO INTERVALS / THE OPTIMAL VALUES OF INTERVAL BOUNDARIES / ESTIMATION OF COMPLEXITY OF THE ALGORITHM

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Mattiev Jamolbek, Matlatipov Sanatbek

In this article is offered the numerical algorithm for selecting optimal boundaries of intervals of feature values of classified objects. The algorithm is invariant to the scale of measurement, it can be used on searching for latent (obviously not measurable) features in databases to modeling of intuitive decision-making process.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Extracting the hidden regularities on latent features by using interval methods in pattern recognition problems»

Section 3. Information technology

Section 3. Information technology

Mattiev Jamolbek, Urgench State University, Teacher in Informatics, the Faculty of Physics and Mathematics E-mail: jamolbek_1992@mail.ru Matlatipov Sanatbek, Urgench State University, Student, the Faculty of Physics and Mathematics E-mail: mr.sanatbek@gmail.com

Extracting the hidden regularities on latent features by using interval methods in pattern recognition problems

Abstract: In this article is offered the numerical algorithm for selecting optimal boundaries of intervals of feature values of classified objects. The algorithm is invariant to the scale of measurement, it can be used on searching for latent (obviously not measurable) features in databases to modeling of intuitive decision-making process.

Keywords: partition into intervals, the optimal values of interval boundaries, estimation of complexity of the algorithm.

Intrudoction

Partition to intervals for values of the quantitative features is widely applied in different algorithms of data analysis. Usually, in applied statistics the value of quantitative features divide into equal intervals, where the number of intervals given in advance. The task of partitioning to intervals was considered in the theory of pattern recognition with supervised learning [5, 2-3].

Using numerical optimization methods allow to select the parameters of the model, in which recognition algorithms allow the least number of errors on a given training set. Increasing the complexity of the model is not always good, as "optimal" algorithms are starting to well adapt to the specific data, including a measurement of the training data and the error of the model.

The model's complexity in the theory of artificial neural network (ANN) is expressed in the term of generalization ability. It is required that the ANN algorithms solve not only the supervised problems, but also able to take a good decision on the objects where have not seen in the process of training. The development of new methods of data mining serve to these goals, which allows to obtain new knowledge about the problem and use them, and also to improve the accuracy ofANN algorithms for any admissible objects [3, 2-4].

In [3, 3-5] the task of splitting into intervals of feature values of classified objects is formulated as deterministic. Checking the following hypothesis is the base of criterion of a method: "There is a partition, where each interval contains all the feature values for the same class objects". Obviously, the number of intervals must be equal to the number of classes. The truth of the hypothesis is proved through the computational experiment.

Described algorithm in the article is invariant to the scale of measurement, it can be use for:

- searching for latent (obviously not measurable) features in databases for modeling of intuitive decisionmaking process;

- extracting the set of informative features with different types.

It is offered the preprocessing of data to reduce the number of calculations.

Statement of the problem

We consider the problem ofrecognition in the standard formulation. It is believed that given a set of objects E0 = {Sp...,Sm} containing representatives l disjoint classes K1,...,Kt. Description of objects is performed using a set ofn quantative features Xn =(xl,..., xn ).

It is required that spliting the quantative features into intervals by two criterions and comparing the results.

To solve this problem, it is used following expression to split the quantative features into domination intervals as a first criterion [4, 10-12].

dt (u, v)

dt (u,v)

max.

(1)

|E0 n Kt| \E0 n CKt[ In this criterion (u, v) — domination interval, dt (u,v) — the number of objects which belong to t th class in [u... v] interval; Kt — the number of objects which belongs to t th class in a given data, t = 1, l. This criterion becomes maximum when the same class objects are located in each interval. It is found the values of membership function in the following formula:

d2 (u,v

d1 (u,v) d1 (u,v)

fci=/( "iKT

K

).

It is recommended to use (1'') to find out the stability on the help of (1').

1

= - I

- { ,, ]

(1'')

f (v -u +1), f d > 0.5, 1(1 - fa )(v - u +1), f < 0.5,.

Stability becoms in [0.5.1] interval.

It is used following expression to split the quantative features into intervals according to the number of classes as a second criterion [4, 40-42].

xf, j e I ordered set of feature values is divided into two disjoint intervals [cp c2],(c2,c3]. The criterion for determining the boundaries based on compactness hypothesis, where each interval contains the feature values from one class objects only.

Extracting the hidden regularities on latent features by using interval methods in pattern recognition problems

2

lu1 (u1 -1) + u2 (u2 -1) i=1_

II Kil (( -1)

11 < ((J - )

2 k,\\k,

■ max. (2)

{Л) 4 /

Table 1. - Spliting some features into domination intervals according to (1)

It allows us to calculate the optimal value of the boundaries between the intervals [cpc2 ],(c2,c3 ] and use it to determine the gradation of the quantitative features in a nominal scale of measurement. The expression on the left bracket is intraclass similarity, on the right — interclass difference.

m1,«2 — the number of features which belongs to K ,i = 1,2 class in [c1,c2 ],(c2,c3 ] intervals. K ,i = 1,2 — the number of objects belongs to i th class in a given data. It is checked the compactness hypothesis through this criterion.

It is offered to use the preproccesing through the constructing of D matrix to decrease the complexity of the algorithm. The meaning of preproccessing is to formalize as rj1,...,rjm ordered sequence the integer value matrix.

The elements of the D matrix is calculated as:

D =

vД Дп -

(0, i = 0, d .= {

pi K i, + g (p,i),i > 0,

il, Se Kp, where: g (p,i ) = -

v ' [0, S g Kp.

Here, index of dpi, p = 1,1 ,i = 1,m element's coloumn corresponds to the value of j feature which belongs to S e E0.

Computational expirement

To illustrate the process visualization objects was used "Gipertaniya" [1, 2-3] data (which is taken from medicine fields). The set is represented 147 objects with 29 quantitative features. Objects are divided into two disjoint classes, K1 (healty people), K2 (ill people). Results of spliting some features into domination intervals according to (1) are presented in table 1. Results of spliting some features into intervals according to (2) are presented in table 2.

Name of feature Domination intervals Stability

Blood pressure (high) [90...140] 0.97

[150...220]

Blood pressure (low) [60.80] 0.94

[85.130]

RR interval [0.6.0.7] 0.81

[0.72.0.88]

[0.9.0.1]

[1.04.1.08]

[1.12.1.28]

Age [17.42] 0.88

[43.80]

Table 2. - Spliting some features into intervals according to (2)

Name of feature Intervals Stability

Blood pressure (high) [90.140] 0.93

[150.220]

Blood pressure (low) [60.80] 0.91

[85.130]

RR interval [0.6.0.76] 0.25

[0.78.1.28]

Age [17.45] 0.61

[46.80]

Conclusion

As we can see in above tables, the feature "Blood pressure (high)" was split into two intervals according to (1) and (2) with 0.97 and 0.93 stabilities respectively. So it was considered as a good feature. The feature "Blood pressure (low)" was also split into two intervals in both criterions and had almost the same stabilities. The feature "RRinterval" was split into 5 domination intervals, its stability was 0.81 according to (1) and this feature was split into 2 intervals according to (2) with 0.25 stability. Although the feature "RRinterval" was split into 5 domination intervals by (1), its stability was better than the stability taken irom (2). The feature "Age" was also split into two intervals in both creterions. This feature's stability by (1) was greater than by (2). As for results it is recommended to use first criterion in "Gipertaniya" data.

References:

1. Ignat'ev N. A., Adilova F. T., Matlatipov G. R., Chernush P. P. Knowledge Discovering from Clinical Data Based on Classification Tasks Solving//MediNFO. - Amsterdam: IOS Press, 2001. - P. 1354-1358.

2. Игнатьев Н. А. Интеллектуальный анализ данных на базе непараметрических методов классификации и разделения выборок объектов поверхностями. - Ташкент, 2008. - 108 с.

3. Игнатьев Н. А. Выбор минимальной конфигурации нейронных сетей//Вычислительные технологии. - Новосибирск, 2001. -Т. 6, № 1. - С. 23-28.

4. Игнатьев Н. А. Обобщенные оценки и локальные метрики объектов в интеллектуальном анализе данных//Монография. -Ташкент: Национальный университет Узбекистана им. Мирзо Улугбека, 2014. - 71 с.

5. Wold S. Pattern recognition by means ofdisjoint principal components models//Pattern Recognition. - 1976. - 8, № 3. - Р. 127-139.

i Надоели баннеры? Вы всегда можете отключить рекламу.