UDC 004.5
Vestnik SibGAU Vol. 17, No. 1, P. 45-49
MULTI-OBJECTIVE BASED FEATURE SELECTION AND NEURAL NETWORKS ENSEMBLE METHOD FOR SOLVING EMOTION RECOGNITION PROBLEM
I. A. Ivanov
Reshetnev Siberian State Aerospace University 31 Krasnoyarsky Rabochy Av., Krasnoyarsk, 660037, Russian Federation E-mail: ilyaiv92@gmail.com
In this paper we apply multi-objective optimization approach to find a Pareto optimal ensemble of neural network classifiers, which is used for solving the emotion recognition problem. Pareto set of neural networks is found by optimizing two conflicting criteria: maximizing emotion classification rate and minimizing the number of neural network neurons. We implemented several ensemble fusion schemes - voting, averaging class probabilities and adding auxiliary meta-classification layer. The number of audio and video features extracted from raw video sequences for analysis is quite large, so we also applied multi-objective approach in order to find the optimal subset of features. The optimized criteria in this case are maximizing classification rate and minimizing the number of features. The multi-objective approach to neural network parameter optimization and to feature selection was compared to the classic single-objective optimization approach on several datasets. According to experimental results, multi-objective approach to neural net optimization provided on average 7.1 % higher emotion classification rate than single-objective optimization. Applying multi-objective approach to feature selection as well helped to improve the classification rate by 2.8 % compared to single-objective approach, by 5.4 % compared to using principal components analysis, and by 13.9 % compared to not using dimensionality reduction at all. Taking into account the obtained results, we suggest using multi-objective approach to machine learning algorithms optimization and feature selection in further research connected with emotion recognition problem and other complex classification tasks.
Keywords: ensemble, neural network, multi-objective optimization, emotion recognition.
Вестник СибГАУ Том 17, № 1. С. 45-49
МНОГОКРИТЕРИАЛЬНЫЙ МЕТОД ОТБОРА ПРИЗНАКОВ И ПОСТРОЕНИЯ КОЛЛЕКТИВА НЕЙРОННЫХ СЕТЕЙ ДЛЯ РЕШЕНИЯ ЗАДАЧИ РАСПОЗНАВАНИЯ ЭМОЦИЙ
И. А. Иванов
Сибирский государственный аэрокосмический университет имени академика М. Ф. Решетнева Российская Федерация, 660037, г. Красноярск, просп. им. газ. «Красноярский рабочий», 31
E-mail: ilyaiv92@gmail.com
Применен подход многокритериальной оптимизации, чтобы найти парето-оптималъный коллектив нейронных сетей, используемых для решения задачи распознавания эмоций. Поиск парето-множества нейросетей ведется путем оптимизации двух противоречивых критериев: максимизируется точность распознавания эмоций и минимизируется число нейронов сети. Реализовано несколько методов слияния классификаторов в коллективе: голосование, усреднение вероятностей классов, добавление вспомогательного слоя метакласси-фикации. Так как количество извлеченных для анализа аудио- и видеопризнаков достаточно велико, также применен многокритериальный подход для нахождения оптимального подмножества признаков. В данном случае оптимизируются следующие критерии: максимизируется точность классификации и минимизируется количество признаков. Многокритериальный подход к оптимизации параметров нейросетей и к отбору признаков был сравнен с классическим однокритериальным подходом на нескольких выборках. Согласно полученным экспериментальным результатам, с помощью многокритериального метода оптимизации нейронных сетей удалось достичь в среднем на 7,1 % лучшей точности классификации эмоций, чем с помощью однокри-териального метода. Применение многокритериального метода к отбору информативных признаков также позволило улучшить точность классификации на 2,8 % по сравнению с однокритериальным методом, на 5,4 % по сравнению с использованием метода главных компонент и на 13,9 % по сравнению с использованием всех признаков, без процедуры снижения размерности. Беря во внимание полученные результаты, рекомендовано использовать многокритериальный подход для оптимизации алгоритмов машинного обучения и для отбора информативных признаков в дальнейших исследованиях, связанных с задачей распознавания эмоций и другими сложными задачами классификации.
Ключевые слова: коллектив классификаторов, нейронная сеть, многокритериальная оптимизация, распознавание эмоций.
Introduction. The problem of configuring machine learning algorithms is crucial for finding effective solutions to practical machine learning and data analysis problems. There has been much research done on developing the algorithms for configuring machine learning algorithms parameters and structure. In this work we apply multi-objective optimization method to neural networks parameter optimization and compare it to the single-objective optimization method. The emotion recognition problem by audio-visual features serves as a benchmark problem. There are to conflicting optimized criteria: emotion classification rate (maximized) and the number of neural network neurons (minimized). In such optimization formulation we end up with finding a Pareto optimal set of neural networks, some of which may naturally have a complex structure (more neurons) and a better classification rate on a train set (overfitting), while some may have a simpler structure and worse classification rate, but at the same time have a lower generalization error (under-fitting). The idea proposed in this work is to combine such diverse Pareto optimal neural networks into an ensemble, that would, hopefully, yield a better classification rate on the test set.
Another important step performed in machine learning is the feature space dimensionality reduction. In this work we also apply multi-objective optimization method to find the optimal subset of features. The optimized criteria are: emotion classification rate (maximized) and the total number of features chosen for further analysis (minimized). The proposed multi-objective optimization approach to feature selection is compared to single-objective optimization approach and to principal components analysis (PC A).
The problem of emotion recognition is a part of a more global problem of human-machine interaction (HMI). The systems that provide means of HMI are called dialogue systems (DS). Dialogue systems consist of several modules: speech analysis, intelligence gathering, taking actions. The gathered intelligence includes person's gender, age, emotional state, ethnicity and other information that might be valuable for making decision about the actions. In this work we focus on the problem of person emotional state classification by the available audio and video information of person's face.
Significant related work. The paper by Rashid et al. [1] explores the problem of human emotion recognition and proposes the solution of combining audio and visual features. First, the audio stream is separated from the video stream. Feature detection and 3D patch extraction are applied to video streams and the dimensionality of video features is reduced by applying PCA. From audio streams prosodic and mel-frequency cepstrum coefficients (MFCC) are extracted. After feature extraction the authors construct separate codebooks for audio and video modalities by applying the K-means algorithm in Euclidean space. Finally, multiclass support vector machine (SVM) classifiers are applied to audio and video data, and decision-level data fusion is performed by applying Bayes sum rule. By building the classifier on audio features the authors received an average accuracy of 67.39 %, using video features gave anaccuracy of 74.15 %, while combining audio and visual features on the decision level improved the accuracy to 80.27 %.
Kahou et al. [2] described the approach they used for submission to the 2013 Emotion Recognition in the Wild Challenge. The approach combined multiple deep neural networks including deep convolutional neural networks (CNNs) for analyzing facial expressions in video frames, deep belief net (DBN) to capture audio information,deep autoencoder to model the spatio-temporal information produced by the human actions, and shallow network architecture focused on theextracted features of the mouth of the primary human subject in the scene. The authors used the Toronto Face Dataset, containing 4.178 images labelled with basic emotions and with only fully frontal facing poses, and a dataset harvested from Google image search which consisted of 35.887 images with seven expression classes. All images were turned to grayscale of size 48x48. Several decision-level data integration techniques were used: averaged predictions, SVM and multilayer perceptron (MLP) aggregation techniques,and random search for weighting models. The best accuracy they achieved on the competition testing set was 41.03 %.
In the work by Cruz et al. [3] the concept of modelling the change in features is used, rather than their simple combination. First, the faces are extracted from the original images, and Local Phase Quantization (LPQ) histograms are extracted in each n x n local region. The histograms are concatenated to form afeature vector. The derivative of features is computed by two methods: convolution with the difference of Gaussians (DoG) filter and thedifference of feature histograms. A linear SVM is trained to output posterior probabilities and the changes are modelled with a hidden Markov model. The proposed method was tested on the Audio/Visual Emotion Challenge 2011 dataset, which consists of 63 videos of 13 different individuals, where frontal face videos are taken during an interview where the subject is engaged in a conversation. The authors claim thatthey increased the classification rate on the data by 13 %.
In [4] the authors exploit the idea of using electroencephalogram, pupillary response and gaze distance to classify the arousal of a subject as either calm, medium aroused, or activated and valence as either unpleasant, neutral, or pleasant. The data consists of 20 video clips with emotional content from movies. The valence classification accuracy achieved is 68.5 %, and thearousal classification accuracy is 76.4 %.
Busso et al. [5] researched the idea of acoustic and facial expression information fusion. They used a database recorded from an actress reading 258 sentences expressing emotions.Separate classifiers based on acoustic data and facial expressions were built, with classification accuracies of 70.9 % and 85 % respectively. Facial expression features include 5 areas: forehead, eyebrow, low eye, right and left cheeks. The authors covered two data fusion approaches: decision level and feature level integration. On the feature level,audio and facial expression features were combined to build one classifier, giving 90 % accuracy. On the decision level, several criteria were used to combine posterior probabilities of the unimodal systems: maximum - theemotion withthe greatest posterior probability in both modalities is selected; average - theposte-rior probability of each modalityis equally weighted and the maximum is selected; product - posterior probabilities
are multiplied and the maximum is selected; weight -different weights are applied to the different unimodal systems. The accuracies of decision-level integration bimodal classifiers range from 84 % to 89 %, product combining being the most efficient.
Methodology. The process of solving the emotion recognition problem, as well as any other classification problem. consists of the following steps:
1. Raw data gathering, in our case obtaining the database of labeled video recordings with emotional content.
2. Feature extraction, in order to perform quantitative analysis, we need quantitative features to base upon.
3. Dimensionality reduction, performed when the number of features is too large, for higher generalizability and lower computational costs.
4. Training classifier, using the train set.
5. Making predictions, using the test set.
Completion of each step is crucial for successful classification problem solving. While gathering valuable data and using the right technique for feature extraction are the most important and time consuming steps, usually it turns out that researchers do not have the ability to collect their own data, or are forced to work with the data that was provided to them. In such cases what really matters, are steps 3 and 4, that are actually about taking as much as you can from available data and implementing the right classifier in order to make accurate predictions. In this work we focus on achieving these two goals.
But let us begin from the beginning. We are using SAVEE emotional database [6] as a data source for our analysis. It contains 480 video recordings of 4 male individuals reading a set of sentences expressing 7 basic emotions: anger, happiness, disgust, neutral, fear, surprise, sadness. We apply openSMILE software [7] for extracting audio features, and 3 video feature extraction methods:
1) Quantized Local Zernike Moments (QLZM) [8];
2) Local Binary Patterns (LBP) [9];
3) Local Binary Patterns on Three Orthogonal Planes
(LBP-TOP) [10].
After that goes dimensionality reduction step, which can be performed in several ways. There are two groups of algorithms used for dimensionality reduction:
1. Feature transformation methods, that take the large number of features and transform them to less number of more informative features. Principal components analysis (PCA) is a popular method of this group.
2. Feature selection methods, that take the large number of features and select the optimal (in some sense) subset of features.
The classic approach in feature selection is to use single-objective optimization algorithms for finding the optimal features that, when used with a proper classifier, would provide the highest classification rate. We went further and applied multi-objective optimization algorithms for solving this problem. The first optimized objective is the same as in single-objective formulation, the classification rate, which is defined as follows:
R = (Nc / N) -100 %, (1)
where Nc is the number of correctly classified instances; N is the total number of dataset instances; R is the classification rate. The second objective is the minimization of
the number of selected features, because this is the essence of dimensionality reduction step:
|F|^ min, (2)
where F is the subset of selected features. Support vector machine (SVM) algorithm was used for classification purposes. Other objectives may be used for optimization during feature selection procedure, like intra-class and inter-class distances [11].
The next step, training the classifier, involves choosing the classification algorithm and adjusting its parameters. We chose feed-forward single-layer neural network algorithm with sigmoid activation function. The popular approach to neural network parameter adjustment is based on using the optimization algorithms for finding the optimal neural net parameter values. In our optimization formulation, the input variables include the overall number of network neurons and the number of iterations for network training. The input variables vary in the following borders. Number of network neurons Nn = 2:50, number of network training iterations Nt = 2:200.
We applied the multi-objective optimization approach to neural network parameter optimization. The optimized criteria are as follows: maximizing the classification rate and minimizing the number of network neurons. As a result of multi-objective optimization, we obtain a Pareto optimal set of neural networks. In order to make possible the comparison of single and multi-objective approaches, we combine Pareto optimal neural network classifiers into an ensemble, fusing the outputs of multiple neural networks by several techniques:
1. Voting, the class that was predicted by the majority of neural nets is chosen as a final prediction.
2. Averaging the class probabilities over all networks, posterior class probabilities for each class are averaged over all neural networks in the ensemble.
3. Adding auxiliary meta-classification layer, training dataset is divided into two parts, the first part is used to train the ensemble classifiers. The output posterior class probabilities of all ensemble classifiers are treated as input variables, and the second part of the training dataset is used to train an auxiliary SVM meta-classifier, which outputs the resulting class prediction.
The class of genetic algorithms was chosen for solving the optimization tasks described above. We used Co-evolutionary Genetic algorithm for single-objective optimization, and several algorithms for multi-objective optimization: Strength Pareto Evolutionary algorithm (SPEA) [12], Non-dominated Sorting Genetic algorithm (NSGA-2) [13], Vector Evaluated Genetic algorithm (VEGA) [14] and Self-configuring Co-evolutionary Multi-objective Genetic algorithm (SelfCOMOGA) [15].
Experimental results. We performed a series of experiments on using the proposed multi-objective optimization approach to feature selection and neural networks parameter optimization. The experiments were conducted on 5 different datasets, including audio features dataset, 3 video features datasets, namely QLZM, LBP and LBP-TOP datasets, and audio-visual dataset that is merely a combination of the audio and 3 video features datasets.
Emotion classification rate, as well as the reduced number of features obtained by different dimensionality reduction techniques can be found in tab. 1. As can be observed, multi-objective approach to feature selection provides the highest classification rate on 4 out of 5 datasets, leaving the second place to single-objective approach to feature selection, and the third one to PCA. The highest achieved emotion classification rate is 45.7 %.
Tab. 2 contains results on the comparison of single and multi-objective approaches to neural network parameter optimization. Different ensemble fusion schemes and multi-objective optimization algorithms were tried. According to the results, multi-objective optimization approach outperformed single-objective optimization approach on all 5 datasets. The highest achieved emotion
classification rate is 39.8 %, which is still high enough for such a complex problem, taking into account that the baseline model, that is, always predicting the most frequently observed class label from the train set, would yield the 25 % accuracy.
Conclusion. In this work we addressed two crucial steps of building an emotion recognition system - dimensionality reduction and classifier training. We applied the multi-objective optimization approach to these two steps, which helped to achieve the 45.7 % classification rate.
As a result of our research, we defined that the proposed approach proved to be useful in both feature selection and neural network parameter optimization procedures, so we recommend using it in further research connected with emotion recognition.
Table 1
Emotion classification rate (%), dimensionality reduction approaches comparison
Dataset Audio QLZM LBP LBP-TOP Audio + video
Initial number of features 991 656 59 177 1883
Emotion classification rate / reduced number of features All features 28.542 10.506 20.486 22.847 19.732
Principal components analysis 35.923 / 131 21.458 / 36 23.75 / 4 32.017 / 10 31.718 / 180
Feature selection Single-objective optimization 38.095 / 476 20.208 / 301 25.972 / 33 40.278 / 77 33.661 / 902
Multi-objective optimization 39.702/ 484 24.911 / 319 25.694 / 31 45.694 / 90 35.893/ 885
Optimization Algorithm (number of objectives) Ensemble Classifiers Output Fusion Scheme Dataset
Audio QLZM LBP LBP-TOP Audio + video
Co-evolutionary GA (1) - 35.923 21.458 23.75 32.917 31.718
SPEA (2) Voting 31.012 16.319 16.667 34.167 27.292
Average class probabilities 16.994 10.903 16.458 39.583 14.256
SVM meta-classifier 28.631 16.042 18.264 34.583 25.06
NSGA-2 (2) Voting 29.226 21.181 19.236 33.403 24.554
Average class probabilities 29.435 14.722 16.667 17.639 23.571
SVM meta-classifier 39.762 11.528 17.5 38.125 34.94
VEGA (2) Voting 33.839 17.5 24.514 32.639 22.5
Average class probabilities 27.262 24.306 20.069 21.042 15.119
SVM meta-classifier 38.899 13.958 29.167 36.736 37.292
SelfCOMOGA (2) Voting 26.577 20.347 33.125 36.25 19.94
Average class probabilities 23.244 15.935 25.417 22.708 17.768
SVM meta-classifier 36.518 26.756 38.333 36.319 29.405
Table 2
Emotion classification rate (%), neural networks optimization and ensemble forming
References
1. Rashid M., Abu-Bakar S. A. R., Mokji M. Human emotion recognition from videos using spatio-temporal and audio features. The Visual Computer, 2012, P. 12691275.
2. Kahou S. E., Pal C., Bouthillier X., Froumenty P., Gulcehre C., Memisevic R., Vincent P., Courville A., Bengio Y. Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction, Sydney, Australia, December 9-13, 2013, P. 543-550.
3. Cruz A., Bhanu B., Thakoor N. Facial emotion recognition in continuous video. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR 2012), Tsukuba, Japan, November 11-15, 2012, P. 1880-1883.
4. Soleymani M., Pantic M., Pun T. Multimodal emotion recognition in response to videos. IEEE Transactions on affective computing, 2012, Vol. 3, No. 2, P. 211-223.
5. Busso C., Deng Z., Yildirim S., Bulut M., Lee C. M., Kazemzadeh A., Lee S., Neumann U., Narayanan S. Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information. In Proceedings of the 6th international conference on Multimodal interfaces, Los Angeles, 2004, P. 205-211.
6. Haq S., Jackson P. J. B. Speaker-dependent audiovisual emotion recognition. In Proceedings Int. Conf. on Auditory-Visual Speech Processing (AVSP'09), Norwich, UK, September 2009, P. 53-58.
7. Eyben F., Wullmer M, Schuller B. OpenSMILE -the Munich versatile and fast open-source audio feature extractor. In Proceedings ACM Multimedia (MM), Florence, Italy, 2010, P. 1459-1462.
8. Sariyanidi E., Gunes H., Gokmen M., Cavallaro A. Local Zernike moment representation for facial affect recognition. Proc. of British Machine Vision Conference, 2013, P. 1-13.
9. Ojala T., Pietikainen M., Harwood D. A comparative study of texture measures with classification based on feature distributions. Pattern Recognition 29, 1996, P. 51-59.
10. Zhao G., Pietikainen M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Analysis and Machine Intelligence 29(6), 2007, P. 915-928.
11. Sidorov M., Brester C., Semenkin E., Minker W. Speaker state recognition with neural network-based classification and self-adaptive heuristic feature selection. In Proceedings International Conference on Informatics in Control, Automation and Robotics (ICINCO), 2014, P. 699-703.
12. Zitzler E., Thiele L. An evolutionary algorithm for multiobjective optimization: the strength Pareto approach. Swiss Federal Institute of Technology, Zurich, Switzerland, TIK-Report No. 43, May 1998, P. 1-40.
13. Deb K., Pratap A., Agarwal S., Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. on Evolutionary Computation, 2002, Vol. 6, No. 2, P. 182-197.
14. Schaffer J. D. Multiple objective optimization with vector evaluated genetic algorithms. Proc. of the 1st International Conference on Genetic Algorithms, 1985, P. 93-100.
15. Ivanov I. A., Sopov E. A. [Self-configuring genetic algorithm for solving multi-objective choice support problems]. Vestnik SibGAU, 2013, No. 1 (47), P. 30-35 (In Russ.).
Библиографические ссылки
1. Rashid M., Abu-Bakar S. A. R., Mokji M. Human emotion recognition from videos using spatio-temporal and audio features // The Visual Computer. 2012. P. 1269-1275.
2. Combining modality specific deep neural networks for emotion recognition in video / S. E. Kahou [et al.] // In Proceedings of the 15th ACM on Intern. Conf. on Multimodal Interaction. Sydney, 2013. P. 543-550.
3. Cruz A., Bhanu B., Thakoor N. Facial emotion recognition in continuous video // In Proceedings of the 21st Intern. Conf. on Pattern Recognition (ICPR 2012) (Tsukuba, Japan, November 11-15). 2012. P. 1880-1883.
4. Soleymani M., Pantic M., Pun T. Multimodal emotion recognition in response to videos // IEEE Transactions on affective computing. 2012. Vol. 3, no. 2. P. 211-223.
5. Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information / C. Busso [et al.] // In Proceedings of the 6th Intern. Conf. on Multimodal interfaces. Los Angeles, 2004. P. 205-211.
6. Haq S., Jackson P. J. B. Speaker-dependent audiovisual emotion recognition // In Proceedings Int. Conf. on Auditory-Visual Speech Processing (AVSP'09). Norwich, 2009, P. 53-58.
7. Eyben F., Wullmer M., Schuller B. OpenSMILE -the Munich versatile and fast open-source audio feature extractor // In Proceedings ACM Multimedia (MM). Florence, 2010. P. 1459-1462.
8. Local Zernike moment representation for facial affect recognition / E. Sariyanidi [et al.] // Proc. of British Machine Vision Conference. 2013. P. 1-13.
9. Ojala T., Pietikainen M., Harwood D. A comparative study of texture measures with classification based on feature distributions // Pattern Recognition. 1996. 29. P. 51-59.
10. Zhao G., Pietikainen M. Dynamic texture recognition using local binary patterns with an application to facial expressions // IEEE Trans. Pattern Analysis and Machine Intelligence. 2007. 29(6). P. 915-928.
11. Speaker state recognition with neural network-based classification and self-adaptive heuristic feature selection / M. Sidorov [et al.] // In Proceedings Intern. Conf. on Informatics in Control, Automation and Robotics (ICINCO). 2014. P. 699-703.
12. Zitzler E., Thiele L. An evolutionary algorithm for multiobjective optimization: the strength Pareto approach // TIK-Report. 1998. No. 43. Zurich, Switzerland, Swiss Federal Institute of Technology P. 1-40.
13. A fast and elitist multiobjective genetic algorithm: NSGA-II / K. Deb [et al.] // IEEE Trans. on Evolutionary Computation. 2002. Vol. 6, No. 2. P. 182-197.
14. Schaffer J. D. Multiple objective optimization with vector evaluated genetic algorithms // Proc. of the 1st Intern. Conf. on Genetic Algorithms. 1985. P. 93-100.
15. Иванов И. А., Сопов E. А. Самоконфигурируемый генетический алгоритм решения задач поддержки многокритериального выбора // Вестник СибГАУ. 2013. № 1 (47). С. 30-35.
© Ivanov I. A., 2016