Научная статья на тему 'Multi-criteria self-adjusting genetic programming for design neural network models in the task of feature selection'

Multi-criteria self-adjusting genetic programming for design neural network models in the task of feature selection Текст научной статьи по специальности «Медицинские технологии»

CC BY
45
7
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
ОТБОР ПРИЗНАКОВ / FEATURE SELECTION / САМООРГАНИЗУЮЩЕЕСЯ ГЕНЕТИЧЕСКОЕ ПРОГРАММИРОВАНИЕ / "SELF-ADJUSTING" GENETIC PROGRAMMING / НЕЙРОСЕТЕВЫЕ КЛАССИФИКАТОРЫ / NEURAL NETWORKS CLASSIFIERS / КОРРЕЛЯЦИЯ / CORRELATION / ТОЧНОСТЬ / PRECISION / РАСПОЗНАВАНИЕ ПОЛА И ВОЗРАСТА / RECOGNITION GENDER AND AGE

Аннотация научной статьи по медицинским технологиям, автор научной работы — Loseva Elena

Предложен новый подход для интеллектуального анализа данных. Описан метод отбора наиболее информативных признаков. Метод основан на применении самоорганизующемся многокритериальном генетическом программировании с использованием нейросетевых классификаторов. Каждый нейросетевой классификатор автоматически формируется с использованием эволюционной процедуры, а качество сетей оценивается по нескольким критериям. Исходная (полный набор признаков) и обновленная база данных (сокращенный набор признаков) протестирована в задаче распознавания пола и возраста человека. После применения этого метода наблюдается улучшение результатов по точности.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

A new approach for selection of informative features is represented. The new method on the multi-criteria «Self-adjusting» genetic programming with modeling of the neural network classifiers is described. Each neural network model automatically form using evolutionary procedure and their effectiveness by some criteria were estimated. The initial (full set of features) and new data base (reduced set of features) in the task of recognition human`s gender and age were tested. After applying proposed method the improvement of the accuracy is observed.

Текст научной работы на тему «Multi-criteria self-adjusting genetic programming for design neural network models in the task of feature selection»

ТЕХНИЧЕСКИЕ НАУКИ

MULTI-CRITERIA SELF-ADJUSTING GENETIC PROGRAMMING FOR DESIGN NEURAL NETWORK MODELS IN THE TASK OF FEATURE

SELECTION

Loseva Elena

Matfer of Siberian State Aerospace University named after academician M. F. Reshetnev, Krasnoyarsk

ABSTRACT

A new approach for selection of informative features is represented. The new method on the multi-criteria «Self-adjuring» genetic programming with modeling of the neural network classifiers is described. Each neural network model automatically form using evolutionary procedure and their effectiveness by some criteria were e^imated. The initial (full set of features) and new data base (reduced set of features) in the task of recognition human's gender and age were te&ed. After applying proposed method the improvement of the accuracy is observed.

АННОТАЦИЯ

Предложен новый подход для интеллектуального анализа данных. Описан метод отбора наиболее информативных признаков. Метод основан на применении самоорганизующемся многокритериальном генетическом программировании с использованием нейросетевых классификаторов. Каждый нейросетевой классификатор автоматически формируется с использованием эволюционной процедуры, а качество сетей оценивается по нескольким критериям. Исходная (полный набор признаков) и обновленная база данных (сокращенный набор признаков) протестирована в задаче распознавания пола и возраста человека. После применения этого метода наблюдается улучшение результатов по точности.

Ключевые слова: отбор признаков; самоорганизующееся генетическое программирование; нейросетевые классификаторы; корреляция; точность; распознавание пола и возраста.

Keywords: feature selection; "Self-adju&ing" genetic programming; neural networks classifiers; correlation; precision; recognition gender and age.

Modern technology allow the computer to implement a dialogue with the user in natural language. So called voice dialogue syflems have the following features: speech recognition and undemanding, dialogue management, the formation of the speech flow. Any dialogue syflem is built on the recognition of sound data from the speaker. Sound wave-vector can comprise different amount of features: the angular frequency, oscillation amplitude, oscillation frequency, sound wavelength, sound intensity, sound pressure, etc. The main difficulty for good working dialog syflems lies in the processing of large amounts of data. Each feature is the vector consifling of a set of points that describe the behavior of the sound wave. All data may have consequences of effects such as noise (natural factor), voice diflortion (human factor), the attributes may have a low level of variation. Therefore, an important flep in the processing of acouflical signal is a selection of informative features. Standard methods for extraction informative features some times do not show high efficiency if data has a large size. Therefore is necessary flep to involve exiting methods or to develop new methods, for example, based on intelligent information technology (IIT). At the present time, data analysis syflems based on the Intelligent Information Technologies (IIT) became more popular in many sectors of human activity. Therefore became more urgent queflion of the development methods for automatic design and adaptation IIT for specific tasks such as feature selection. Necessary to eliminate expensive design of

IIT and reduce the time, which is required for the development of intelligent sy^ems. One of the mo& perspective and popular technology is an artificial neural networks (ANN). The range of tasks which are solved by artificial neural networks is wide (classification, prediction). In this article the new approach using multi-criteria genetic programming for modeling artificial neural network (ANN) classifiers in the task of feature selection (SelfAGP+ANN) was proposed. Using «Self-adju&ing» procedure for evolutionary algorithm allows to choose the optimal combination of evolution operators (EO) automatically. Thence, to reduce the computational resources and requires to an end user.

For realization acou^ical feature selection task a two data bases in Krasnoyarsk city on the recording Sudio «WAVE» in 2014 year was created. For recognition a human's age the data base RSDB - A (Eng. Russian Sound Data Base - Age) was created, which consi^s of voices of people from 14 to 18 and from 19 to 60 years old. For recognition the human's gender the data base RSDB - G (Eng. Russian Sound Data Base - Gender) was created, which consi^s of voices of both human's gender (man, woman) (Table 1). The processing of feature extraction from the sound recorders by following software packages: Notepad ++, Praat («script») «Цитата» [2, p. 15], Excel 972003 was realized. All described data bases have «full» sets of features.

Used data bases description

Table 1.

Data base name Language Volume of data base Amount of features Name of classes

RSDB - A Russian 800 50 Adult,Underage

RSDB - G Russian 800 50 Man, Woman

Proposed algorithm SelfAGP is based on genetic programming (GP) algorithm. Genetic programming to solve a wide range of tasks is used: the task of symbolic regression analysis, optimization, etc. For using GP technology is necessary to code objects in the form of a tree. The tree is a directed graph consifling of nodes and end vertex (leaves). Nodes is a multiplicity F {+, <}, which is consifls of two types

of mathematical operators), and the leaves are composed from multiplicity T {IN1, IN2, IN3,..., INn - input blocks (feature set from data base), F1, F2, F3, F4 ..., Fn - activation functions (neurons)} (figure 1). Operator «+» from multiplicity F indicates formation all neurons in one layer and operator «<» indicates formation all layers in ANN. The amount of the input blocks in ANN consifl of set size of features «Цитата» [5, p. 340].

Figure 1. Tree type of neural network model flructure

For realization a feature selection task was used multi-criteria approach. For eflimation accuracy is used neural network classifiers. A general schema for realization algorithm is the following:

Step 1. Creating a population of individuals. Each individual is a tree - ANN.

Step 2. Optimization of the neural network weighting factors by one-criteria the genetic algorithm (GA). The criteria for flopping the GA is the maximum value of classification accuracy.

Step 3. Choosing evolutionary operators. All combinations of the EO have an equal probabilities of being selected in this flep. In other flep is necessary to recalculate new combinations of EO. All combinations of EO were formed with different types of operators. There are two types of selection operators (tournament, proportion), two types of mutation operators (flrong, weak) and one type for recombination (one-point) were used.

Step 4. Eflimation individuals by fitness functions:

The firfl criteria is the value of the pair correlation. The fitness function is calculated by formula (1):

Fit_1 =-1-

1 + measure (i)

where «measure» is a maximum of pair correlation value between input signals in ANN - formula (2):

measure = (corr,,...,corrT) ^ m® (2)

where T - amount of input signal pairs; corrT - is pair correlation value, which are calculated by the formula (3):

M

£(x"i - xi jj x*+1i -xi+1)

X ( x x )2 X ( x "+1i- x,+1 )2

(3)

where (x 'x ) i

is a input signals pair; n 1N - amount

of input signals; M - size of date base; T - the amount of input signals pairs in ANN.

The second criteria is the precision of classification. The fitness function is calculated by the formula (4):

P

Fit _ 2 =--> max

N (4)

where P - is the amount of correctly classified objects; N -amount of classified objects.

The third criteria is the ANN complexity. The fitness function is calculated by the formula (5):

Fit_3 = n • N +XNN+1 + Nl • I

i=1 (5)

where n - amount of input signals (neurons); Ni - amount of neurons in the i-th layer; i - number of hided layer; L - the

corr,. =

T

amount of hided layers in neural network; l - the amount of neurons in the lafl layer.

Step 5. Selection two parents for recombination by VEGA (Vector Evaluated Genetic Algorithm) method. The selection is based on the suitability of individuals by each of the K criteria separately. Therefore, the intermediate population with equal portions of individuals is filled, which are selected by each of the criteria type. The VEGA «Цитата» [1, p. 43] method for selection intermediate population is following:

Input: P' (current population).

Output: P' (intermediate population).

- To inflall parameter k=1, k _ 1K and P' = 0, where K is the amount of criteria.

- For each individual i e Pt , i = 1, N to calculate fitness, using k-th criterion.

- For s = 1, N / K to select the individual i e Pt in the current population by the selection operator and copy individual

in intermediate population P' : P' = P' + {i} .

- To inflall k = k+1.

- If k<K, go to flep 2, otherwise P' - the resulting intermediate population.

- Choosing two individuals (parents) from the intermediate

population P' by one of type of selection operator.

Step 6. Recombination of two selected individuals.

Step 7. Mutation of descendant.

Step 8. Evaluation new descendant.

Step 9. Choosing a new combination of EO. The efficiency of EO on the previous fleps by the formula (6) is calculated:

1 Np

Fit _Oper =--^ (Fit1ind + Fit2ind ) ^ max

np ind=1 (6)

Parameters

where nd = 1, Np - amount of determined iteration in

algorithm; Fit1nd, Fit2ind - fitness of individuals by two

functions; n - amount of descendents, which were created by chosen variant of EO combination.

The number of added fitness functions may be different, it

depends on the algorithm. After comparing values Fit _ Oper of EO combinations, the variant of EO with highefl value becomes a «priority» option and its probability increasing on 0.05. A combination of EO with the lowefl probability value changes on the «priority» variant. The recalculation of probabilities at each iteration of the algorithm is realized. If all combinations on a «priority» option have been placed, all probability values clear and «Self - adjufling» procedure repeats (flep 1). A new variants of EO combination again are generated.

Step 10. If the algorithm reached the predetermined value of accuracy or exhaufled the computational resources - go to flep (11), otherwise go to flep (2).

Step 11. Choose the «befl» (the mofl efficiency) individual.

The «befl» individual is the ANN with optimal set of input values (set of features). After proposed algorithm working the initial set of features have been update. New set of features «shortened» are called.

For comparison the effectiveness on the «full» and «shortened» sets of features the following classifiers were chosen:

- Support Vector Machine - SVM, which was used for training method of sequential minimal optimization George. Platt;

- Simple Logiflic;

- Naive Bayes Kernel classifier;

- Sequential Minimal Optimization (SMO);

- Additive Logiflic Regression - Rule Induction;

To improve the accuracy of each classifier the algorithm «Optimization Parameter (Evolutionary)» was applied, which bases on the optimization parameters of classifier by one-criteria GA. The optimized parameters are represented in Table 2.

Table 2.

? the classifiers

Classifier Parameters

Naive Bayes Kernel application_grid_size; number_of_kernels; minimum_ bandwidth

SMO V - the amount of layers for the internal cross-checking; C -coefficient of retraining; M - the logiflic model

Simple Logi^ic I - amount of iteration

Rule Induction Pureness; sample_ratio

The Parting inflallation for proposed algorithm are following: maximum number of layers in ANN - 8, number of neurons in each layer in ANN - 5, maximum number of individuals

Experimental results

- 80. Each data base for tefl and train in proportion 80% / 20% respectively were divided. Table 3 contains the relative classification accuracy after 20 runs for all set of features.

Table 3.

Classifier Accuracy, % («full» data base) Accuracy, % («shortened» data base) Difference, % Accuracy, % («full» data base) Accuracy, % («shortened» data base) Difference, %

RSDB-A RSDB-A RSDB-G RSDB-G

SMO 95,01 96,87 1,77 95,56 95,2 0,36

Simple LogiSic 94,12 94,92 0,8 94,16 94,3 0,14

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Rule Induction 91,22 93,16 1,94 92 93,1 1,1

Naïve Bayes 95,01 96 0,99 94,5 94,87 0,37

Kernel

The algorithm with Visual Studio C # was realized. All tefls on a laptop with 1 terabyte of memory and the four-core processor Intel Core i5-2410 (2.10 GHz) were done. Also for realization research tefl with different types of classifiers Rapid Miner v. 5.3 [3, p. 80] software with additional source Weka was used «Цитата» [4, p. 5].

In the conclusion of the research in this paper should to say, that the described algorithm SelfAGP+ANN for future selection good results is shown. An accuracy of used classifiers with «shortened» features set in general is better, than with «full» features set. The difference between accuracy results are more than 0,14, but less than 2. An evolutionary algorithm (GP) with design of neural network can be applied in different area of optimization tasks (problems), also in the task of feature selection. The reduction of features in all data bases was on average from 50 to 30 for the RSDB-A and for the RSDB-G - from 50 to 26 attributes. According to the research it can be concluded that the developed approach shows optimal result after tefl, what is confirmed in the Table 3. This algorithm allows to find relevant features set by applying compact and accurate neural networks. This approach is useful for those category of

tasks, also can be useful to implement in dialog syflem and to improve their efficiency.

References:

1. Ashish G., Satchidanada D. Evolutionary Algorithm for Multi-Criterion Optimization: A Survey. International Journal of Computing & Information Science, Vol. 2, No. 1.

2. Boersma P. Praat, a syflem for doing phonetics by computer. Glot international, 5(9/10), 2002.

3. Fareed Akthar, Caroline Hahne. Rapid Miner 5: Operator reference// Dortmund, 2012.

4. Hall M. [et al.]. The WEKA Data Mining Software: An Update, SIGKDD Explorations. 2009. Vol. 11, iss. 1.http://modis. ispras.ru/seminar/wp-content/uploads/2012/07/Coursework.pdf.

5. Loseva E.D., Lipinsky L.V. Ensembles of neural network classifiers using genetic programming multi-criteria self-configuring // Actual problems of aviation and cosmonautics: Sat. abflracts. 2015. Part 1.

i Надоели баннеры? Вы всегда можете отключить рекламу.