ТЕХНИЧЕСКИЕ НАУКИ
MULTI-CRITERIA SELF-ADJUSTING GENETIC PROGRAMMING FOR DESIGN NEURAL NETWORK MODELS IN THE TASK OF FEATURE
SELECTION
Loseva Elena
Matfer of Siberian State Aerospace University named after academician M. F. Reshetnev, Krasnoyarsk
ABSTRACT
A new approach for selection of informative features is represented. The new method on the multi-criteria «Self-adjuring» genetic programming with modeling of the neural network classifiers is described. Each neural network model automatically form using evolutionary procedure and their effectiveness by some criteria were e^imated. The initial (full set of features) and new data base (reduced set of features) in the task of recognition human's gender and age were te&ed. After applying proposed method the improvement of the accuracy is observed.
АННОТАЦИЯ
Предложен новый подход для интеллектуального анализа данных. Описан метод отбора наиболее информативных признаков. Метод основан на применении самоорганизующемся многокритериальном генетическом программировании с использованием нейросетевых классификаторов. Каждый нейросетевой классификатор автоматически формируется с использованием эволюционной процедуры, а качество сетей оценивается по нескольким критериям. Исходная (полный набор признаков) и обновленная база данных (сокращенный набор признаков) протестирована в задаче распознавания пола и возраста человека. После применения этого метода наблюдается улучшение результатов по точности.
Ключевые слова: отбор признаков; самоорганизующееся генетическое программирование; нейросетевые классификаторы; корреляция; точность; распознавание пола и возраста.
Keywords: feature selection; "Self-adju&ing" genetic programming; neural networks classifiers; correlation; precision; recognition gender and age.
Modern technology allow the computer to implement a dialogue with the user in natural language. So called voice dialogue syflems have the following features: speech recognition and undemanding, dialogue management, the formation of the speech flow. Any dialogue syflem is built on the recognition of sound data from the speaker. Sound wave-vector can comprise different amount of features: the angular frequency, oscillation amplitude, oscillation frequency, sound wavelength, sound intensity, sound pressure, etc. The main difficulty for good working dialog syflems lies in the processing of large amounts of data. Each feature is the vector consifling of a set of points that describe the behavior of the sound wave. All data may have consequences of effects such as noise (natural factor), voice diflortion (human factor), the attributes may have a low level of variation. Therefore, an important flep in the processing of acouflical signal is a selection of informative features. Standard methods for extraction informative features some times do not show high efficiency if data has a large size. Therefore is necessary flep to involve exiting methods or to develop new methods, for example, based on intelligent information technology (IIT). At the present time, data analysis syflems based on the Intelligent Information Technologies (IIT) became more popular in many sectors of human activity. Therefore became more urgent queflion of the development methods for automatic design and adaptation IIT for specific tasks such as feature selection. Necessary to eliminate expensive design of
IIT and reduce the time, which is required for the development of intelligent sy^ems. One of the mo& perspective and popular technology is an artificial neural networks (ANN). The range of tasks which are solved by artificial neural networks is wide (classification, prediction). In this article the new approach using multi-criteria genetic programming for modeling artificial neural network (ANN) classifiers in the task of feature selection (SelfAGP+ANN) was proposed. Using «Self-adju&ing» procedure for evolutionary algorithm allows to choose the optimal combination of evolution operators (EO) automatically. Thence, to reduce the computational resources and requires to an end user.
For realization acou^ical feature selection task a two data bases in Krasnoyarsk city on the recording Sudio «WAVE» in 2014 year was created. For recognition a human's age the data base RSDB - A (Eng. Russian Sound Data Base - Age) was created, which consi^s of voices of people from 14 to 18 and from 19 to 60 years old. For recognition the human's gender the data base RSDB - G (Eng. Russian Sound Data Base - Gender) was created, which consi^s of voices of both human's gender (man, woman) (Table 1). The processing of feature extraction from the sound recorders by following software packages: Notepad ++, Praat («script») «Цитата» [2, p. 15], Excel 972003 was realized. All described data bases have «full» sets of features.
Used data bases description
Table 1.
Data base name Language Volume of data base Amount of features Name of classes
RSDB - A Russian 800 50 Adult,Underage
RSDB - G Russian 800 50 Man, Woman
Proposed algorithm SelfAGP is based on genetic programming (GP) algorithm. Genetic programming to solve a wide range of tasks is used: the task of symbolic regression analysis, optimization, etc. For using GP technology is necessary to code objects in the form of a tree. The tree is a directed graph consifling of nodes and end vertex (leaves). Nodes is a multiplicity F {+, <}, which is consifls of two types
of mathematical operators), and the leaves are composed from multiplicity T {IN1, IN2, IN3,..., INn - input blocks (feature set from data base), F1, F2, F3, F4 ..., Fn - activation functions (neurons)} (figure 1). Operator «+» from multiplicity F indicates formation all neurons in one layer and operator «<» indicates formation all layers in ANN. The amount of the input blocks in ANN consifl of set size of features «Цитата» [5, p. 340].
Figure 1. Tree type of neural network model flructure
For realization a feature selection task was used multi-criteria approach. For eflimation accuracy is used neural network classifiers. A general schema for realization algorithm is the following:
Step 1. Creating a population of individuals. Each individual is a tree - ANN.
Step 2. Optimization of the neural network weighting factors by one-criteria the genetic algorithm (GA). The criteria for flopping the GA is the maximum value of classification accuracy.
Step 3. Choosing evolutionary operators. All combinations of the EO have an equal probabilities of being selected in this flep. In other flep is necessary to recalculate new combinations of EO. All combinations of EO were formed with different types of operators. There are two types of selection operators (tournament, proportion), two types of mutation operators (flrong, weak) and one type for recombination (one-point) were used.
Step 4. Eflimation individuals by fitness functions:
The firfl criteria is the value of the pair correlation. The fitness function is calculated by formula (1):
Fit_1 =-1-
1 + measure (i)
where «measure» is a maximum of pair correlation value between input signals in ANN - formula (2):
measure = (corr,,...,corrT) ^ m® (2)
where T - amount of input signal pairs; corrT - is pair correlation value, which are calculated by the formula (3):
M
£(x"i - xi jj x*+1i -xi+1)
X ( x x )2 X ( x "+1i- x,+1 )2
(3)
where (x 'x ) i
is a input signals pair; n 1N - amount
of input signals; M - size of date base; T - the amount of input signals pairs in ANN.
The second criteria is the precision of classification. The fitness function is calculated by the formula (4):
P
Fit _ 2 =--> max
N (4)
where P - is the amount of correctly classified objects; N -amount of classified objects.
The third criteria is the ANN complexity. The fitness function is calculated by the formula (5):
Fit_3 = n • N +XNN+1 + Nl • I
i=1 (5)
where n - amount of input signals (neurons); Ni - amount of neurons in the i-th layer; i - number of hided layer; L - the
corr,. =
T
amount of hided layers in neural network; l - the amount of neurons in the lafl layer.
Step 5. Selection two parents for recombination by VEGA (Vector Evaluated Genetic Algorithm) method. The selection is based on the suitability of individuals by each of the K criteria separately. Therefore, the intermediate population with equal portions of individuals is filled, which are selected by each of the criteria type. The VEGA «Цитата» [1, p. 43] method for selection intermediate population is following:
Input: P' (current population).
Output: P' (intermediate population).
- To inflall parameter k=1, k _ 1K and P' = 0, where K is the amount of criteria.
- For each individual i e Pt , i = 1, N to calculate fitness, using k-th criterion.
- For s = 1, N / K to select the individual i e Pt in the current population by the selection operator and copy individual
in intermediate population P' : P' = P' + {i} .
- To inflall k = k+1.
- If k<K, go to flep 2, otherwise P' - the resulting intermediate population.
- Choosing two individuals (parents) from the intermediate
population P' by one of type of selection operator.
Step 6. Recombination of two selected individuals.
Step 7. Mutation of descendant.
Step 8. Evaluation new descendant.
Step 9. Choosing a new combination of EO. The efficiency of EO on the previous fleps by the formula (6) is calculated:
1 Np
Fit _Oper =--^ (Fit1ind + Fit2ind ) ^ max
np ind=1 (6)
Parameters
where nd = 1, Np - amount of determined iteration in
algorithm; Fit1nd, Fit2ind - fitness of individuals by two
functions; n - amount of descendents, which were created by chosen variant of EO combination.
The number of added fitness functions may be different, it
depends on the algorithm. After comparing values Fit _ Oper of EO combinations, the variant of EO with highefl value becomes a «priority» option and its probability increasing on 0.05. A combination of EO with the lowefl probability value changes on the «priority» variant. The recalculation of probabilities at each iteration of the algorithm is realized. If all combinations on a «priority» option have been placed, all probability values clear and «Self - adjufling» procedure repeats (flep 1). A new variants of EO combination again are generated.
Step 10. If the algorithm reached the predetermined value of accuracy or exhaufled the computational resources - go to flep (11), otherwise go to flep (2).
Step 11. Choose the «befl» (the mofl efficiency) individual.
The «befl» individual is the ANN with optimal set of input values (set of features). After proposed algorithm working the initial set of features have been update. New set of features «shortened» are called.
For comparison the effectiveness on the «full» and «shortened» sets of features the following classifiers were chosen:
- Support Vector Machine - SVM, which was used for training method of sequential minimal optimization George. Platt;
- Simple Logiflic;
- Naive Bayes Kernel classifier;
- Sequential Minimal Optimization (SMO);
- Additive Logiflic Regression - Rule Induction;
To improve the accuracy of each classifier the algorithm «Optimization Parameter (Evolutionary)» was applied, which bases on the optimization parameters of classifier by one-criteria GA. The optimized parameters are represented in Table 2.
Table 2.
? the classifiers
Classifier Parameters
Naive Bayes Kernel application_grid_size; number_of_kernels; minimum_ bandwidth
SMO V - the amount of layers for the internal cross-checking; C -coefficient of retraining; M - the logiflic model
Simple Logi^ic I - amount of iteration
Rule Induction Pureness; sample_ratio
The Parting inflallation for proposed algorithm are following: maximum number of layers in ANN - 8, number of neurons in each layer in ANN - 5, maximum number of individuals
Experimental results
- 80. Each data base for tefl and train in proportion 80% / 20% respectively were divided. Table 3 contains the relative classification accuracy after 20 runs for all set of features.
Table 3.
Classifier Accuracy, % («full» data base) Accuracy, % («shortened» data base) Difference, % Accuracy, % («full» data base) Accuracy, % («shortened» data base) Difference, %
RSDB-A RSDB-A RSDB-G RSDB-G
SMO 95,01 96,87 1,77 95,56 95,2 0,36
Simple LogiSic 94,12 94,92 0,8 94,16 94,3 0,14
Rule Induction 91,22 93,16 1,94 92 93,1 1,1
Naïve Bayes 95,01 96 0,99 94,5 94,87 0,37
Kernel
The algorithm with Visual Studio C # was realized. All tefls on a laptop with 1 terabyte of memory and the four-core processor Intel Core i5-2410 (2.10 GHz) were done. Also for realization research tefl with different types of classifiers Rapid Miner v. 5.3 [3, p. 80] software with additional source Weka was used «Цитата» [4, p. 5].
In the conclusion of the research in this paper should to say, that the described algorithm SelfAGP+ANN for future selection good results is shown. An accuracy of used classifiers with «shortened» features set in general is better, than with «full» features set. The difference between accuracy results are more than 0,14, but less than 2. An evolutionary algorithm (GP) with design of neural network can be applied in different area of optimization tasks (problems), also in the task of feature selection. The reduction of features in all data bases was on average from 50 to 30 for the RSDB-A and for the RSDB-G - from 50 to 26 attributes. According to the research it can be concluded that the developed approach shows optimal result after tefl, what is confirmed in the Table 3. This algorithm allows to find relevant features set by applying compact and accurate neural networks. This approach is useful for those category of
tasks, also can be useful to implement in dialog syflem and to improve their efficiency.
References:
1. Ashish G., Satchidanada D. Evolutionary Algorithm for Multi-Criterion Optimization: A Survey. International Journal of Computing & Information Science, Vol. 2, No. 1.
2. Boersma P. Praat, a syflem for doing phonetics by computer. Glot international, 5(9/10), 2002.
3. Fareed Akthar, Caroline Hahne. Rapid Miner 5: Operator reference// Dortmund, 2012.
4. Hall M. [et al.]. The WEKA Data Mining Software: An Update, SIGKDD Explorations. 2009. Vol. 11, iss. 1.http://modis. ispras.ru/seminar/wp-content/uploads/2012/07/Coursework.pdf.
5. Loseva E.D., Lipinsky L.V. Ensembles of neural network classifiers using genetic programming multi-criteria self-configuring // Actual problems of aviation and cosmonautics: Sat. abflracts. 2015. Part 1.