DEVELOPING A SPEECH EMOTION RECOGNITION SYSTEM USING CNN ENCODERS WITH ATTENTION FOCUS

Valentina Mamutova; Alpamis Kutlimuratov; Temur Ochilov

APRIL 27-28, 2023

DEVELOPING A SPEECH EMOTION RECOGNITION SYSTEM USING CNN ENCODERS WITH ATTENTION FOCUS Valentina Mamutova1, Alpamis Kutlimuratov2, Temur Ochilov3

department of Telecommunication engineering , Nukus branch of Tashkent University of Information Technologies Named after Muhammad Al-Khwarizmi, Nukus city, Uzbekistan ^Department of Information-Computer Technologies and Programming, Tashkent University of Information Technologies Named after Muhammad Al-Khwarizmi, Tashkent 100200,

Uzbekistan

3Department of Computer systems, Tashkent University of Information Technologies Named after Muhammad Al-Khwarizmi, Tashkent 100200, Uzbekistan https://doi.org/10.5281/zenodo.7864652

Abstract. The study aimed to improve speech emotion recognition (SER) models by developing a new model that can meticulously learn human emotions through speech. The model uses attention-oriented parallel convolutional neural network (CNN) encoders to extract and interpret various crucial speech features from raw speech data, which are then usedfor emotion classification. Specifically, the model encodes MFCC, paralinguistic, and speech spectrogram features by designing different CNN architectures individually for each feature.

Keywords: Speech emotion recognition, MFCC, CNN, Spectrogram

Introduction

Speech emotion recognition (SER) is the process of identifying the emotional state of a speaker based on their speech. Emotions are an important aspect of human communication, and being able to recognize them can have several applications, including in human-computer interaction, speech therapy, and mental health diagnosis. SER [1] involves several steps, including feature extraction, feature selection, and classification. Feature extraction involves extracting relevant features from the speech signal, such as pitch, intensity, and spectral features. Feature selection involves selecting the most relevant features for the classification task. Classification involves using a machine learning model to classify the emotional state of the speaker based on the extracted features. Several machine learning models have been developed for SER, including traditional models such as Support Vector Machines (SVM) [2] and more recent deep learning models [3] such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) [4,5]. SVM works by finding the optimal hyperplane that separates the data into different classes. CNN works by applying convolutional filters to the input speech signal to extract relevant features. RNN works by processing the input speech signal sequentially, allowing it to capture temporal dependencies between features [8].

SER has several challenges, including variability in speech signals due to different speakers, languages, and dialects, as well as variability in emotional expressions due to cultural and individual differences. SER also requires a large amount of labeled data for training and evaluation, which can be a limitation in some applications [9].

Despite these challenges, SER has several potential applications, including in humancomputer interaction, speech therapy, and mental health diagnosis. For example, SER can be used to develop intelligent virtual assistants that can recognize and respond to the emotional state of the

APRIL 27-28, 2023

user. SER can also be used in speech therapy to help individuals with communication disorders to recognize and express emotions. Finally, SER can be used in mental health diagnosis to help identify individuals with depression, anxiety, and other mental health conditions based on their speech patterns [10-11].

Main part

In this research, we aimed to construct a new SER model that employs parallel CNN encoders with an attention-oriented approach. The model (Figure 1) acquires important features for emotion classification by simultaneously considering significant speech spectrogram and utilizing a CNN encoder for paralinguistic speech features, which are then separately utilized for their intended purposes before being fed to attention mechanisms for further representation [12].

This study has made several significant contributions towards improving SER. The following are the key contributions of this research:

> Firstly, the study has improved the model complexity of SER by utilizing attention-oriented parallel CNN encoders that can effectively capture significant speech spectrogram and low-level (paralinguistic) feature representation. This approach has allowed for a more accurate and efficient classification of emotions in speech signals.

> Secondly, the research has successfully incorporated low-level (paralinguistic) features, which are often neglected in traditional SER models. These features can provide valuable information related to the emotional state of the speaker, such as tone, pitch, and intonation.

> Thirdly, the study has improved the generalization capability of SER models. By utilizing parallel CNN encoders with attention mechanisms, the model can better adapt to different emotional expressions and speech patterns, improving its ability to generalize to new and unseen data.

> Fourthly, the study has effectively managed the challenge of processing speeches of varying lengths. By incorporating attention mechanisms, the model can effectively focus on the most important features of the speech signal, regardless of the speech length, leading to more

accurate and efficient emotion recognition.

APRIL 27-28, 2023

Figure 1. The workflow of the developed SER

Mel Frequency Cepstral Coefficients (MFCC) is a feature extraction technique commonly used in speech processing and speech recognition tasks, including speech emotion recognition. MFCC is based on the human auditory system's perception of sound, which is more sensitive to changes in frequency at lower frequencies than at higher frequencies.

The MFCC feature extraction process involves several steps. First, the speech signal is divided into short frames, typically 20-30 milliseconds in duration. Each frame is then windowed using a Hamming window to reduce spectral leakage. The power spectrum of each frame is then computed using the Fast Fourier Transform (FFT) [13].

Next, the power spectrum is transformed into the Mel frequency scale, which is a nonlinear scale that approximates the human auditory system's perception of sound. This is done by applying a filterbank of triangular filters that are spaced uniformly on the Mel scale. The output of each filter is then summed to obtain the Mel frequency spectrum.

Finally, the Mel frequency spectrum is transformed into the cepstral domain using the Discrete Cosine Transform (DCT). The resulting coefficients are known as MFCC and are used as features for speech processing and speech recognition tasks.

MFCC has several advantages over other feature extraction techniques, including its ability to capture the spectral characteristics of speech signals and its robustness to noise and channel distortions. MFCC has been widely used in speech processing and speech recognition tasks, including speech emotion recognition, where it has been shown to be effective in capturing emotional features of speech signals.

A spectrogram is a visual representation of the frequency content of a signal over time. It is commonly used in speech processing and speech recognition tasks, including speech emotion recognition. A spectrogram displays the frequency content of a signal on the y-axis, the time on the x-axis, and the amplitude of the frequency content as a color or grayscale intensity.

The spectrogram provides a detailed view of the frequency content of a signal over time, allowing for the identification of specific frequency components and their temporal characteristics.

APRIL 27-28, 2023

In speech processing and speech recognition tasks, the spectrogram is often used to identify phonemes, which are the basic units of speech sounds [14].

The spectrogram has several advantages over other feature extraction techniques, including its ability to capture the spectral characteristics of speech signals and its ability to provide a detailed view of the frequency content of a signal over time. However, the spectrogram also has limitations, including its sensitivity to noise and its inability to capture temporal dependencies between features.

The waveform is the graphical representation of the speech signal over time. The waveform displays the amplitude of the speech signal on the y-axis and the time on the x-axis. The waveform is a fundamental representation of the speech signal and is often used in speech processing and speech recognition tasks, including SER.

The waveform can provide important information about the speech signal, including its duration, intensity, and frequency content. In SER, the waveform can be used to identify changes in the fundamental frequency of the speaker's voice, which is often associated with changes in emotional state. For example, an increase in pitch may indicate excitement or happiness, while a decrease in pitch may indicate sadness or depression.

Attention mechanism is a deep learning technique that has been applied to SER tasks in recent years. Attention mechanism allows the model to focus on specific parts of the input speech signal that are most relevant to the classification task, improving the model's performance and interpretability.

In SER, attention mechanism works by assigning weights to different parts of the input speech signal based on their relevance to the emotional state of the speaker. The weights are learned during the training process and are used to compute a weighted sum of the input features, which is then used as input to the classification model.

Attention mechanism has several advantages over traditional deep learning models, including its ability to capture temporal dependencies between features and its ability to handle variable-length input sequences. Attention mechanism has been shown to improve the performance of deep learning models on SER tasks, particularly when combined with other techniques such as CNN [6,7] and RNN.

One of the main benefits of attention mechanism in SER is its interpretability. Attention mechanism allows the model to identify the specific parts of the input speech signal that are most relevant to the emotional state of the speaker, providing insights into the features that are most important for emotional classification. This can be particularly useful in applications such as speech therapy, where understanding the specific features that contribute to emotional expression can help individuals with communication disorders to improve their emotional expression.

Finally, the research has introduced a novel SER methodology that has outperformed baseline models in terms of accuracy. The proposed model has demonstrated state-of-the-art performance on benchmark SER datasets, highlighting the potential of this approach for real-world applications.

Overall, the contributions of this study have advanced the field of SER, highlighting the potential of attention-oriented parallel CNN encoders and low-level feature representation for improving the accuracy, efficiency, and generalization capabilities of SER models.

APRIL 27-28, 2023

REFERENCES

1. Makhmudov, F.; Kutlimuratov, A.; Akhmedov, F.; Abdallah, M.S.; Cho, Y.-I. Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders. Electronics 2022, 11, 4047. https://doi.org/10.3390/electronics1123404

2. Kutlimuratov, A.; Abdusalomov, A.; Whangbo, T.K. Evolving Hierarchical and Tag Information via the Deeply Enhanced Weighted Non-Negative Matrix Factorization of Rating Predictions. Symmetry 2020, 12, 1930.

3. Ilyosov, A.; Kutlimuratov, A.; Whangbo, T.-K. Deep-Sequence-Aware Candidate Generation for e-Learning System. Processes 2021, 9, 1454. https://doi.org/10.3390/pr9081454.

4. Kutlimuratov, A.; Abdusalomov, A.B.; Oteniyazov, R.; Mirzakhalilov, S.; Whangbo, T.K. Modeling and Applying Implicit Dormant Features for Recommendation via Clustering and Deep Factorization. Sensors 2022, 22, 8224. https://doi.org/10.3390/s22218224.

5. Safarov F, Kutlimuratov A, Abdusalomov AB, Nasimov R, Cho Y-I. Deep Learning Recommendations of E-Education Based on Clustering and Sequence. Electronics. 2023; 12(4):809. https://doi.org/10.3390/electronics12040809

6. Abdusalomov, A.; Baratov, N.; Kutlimuratov, A.; Whangbo, T.K. An Improvement of the Fire Detection and Classification Method Using YOLOv3 for Surveillance Systems. Sensors 2021, 21, 6519. https://doi.org/10.3390/s21196519.

7. Abdusalomov, A.B.; Mukhiddinov, M.; Kutlimuratov, A.; Whangbo, T.K. Improved RealTime Fire Warning System Based on Advanced Technologies for Visually Impaired People. Sensors 2022, 22, 7305. https://doi.org/10.3390/s22197305.

8. Kuchkorov, T., Khamzaev, J., Allamuratova, Z., & Ochilov, T. (2021, November). Traffic and road sign recognition using deep convolutional neural network. In 2021 International Conference on Information Science and Communications Technologies (ICISCT) (pp. 1-5). IEEE. DOI: 10.1109/ICISCT52966.2021.9670228

9. Khamzaev J., Yaxshiboyev R., Ochilov T., Siddiqov B. Driver sleepiness detection using convolution neural network. Central Asian Journal of Education and Computer Sciences. VOLUME 1, ISSUE 4, AUGUST 2022(CAJECS), ISSN: 2181-3213

10. Kuchkarov T. A., Hamzayev J. F., Allamuratova Z. J. Tracking the flow of motor vehicles on the roads with YOLOv5 and deepsort algorithms. Международной научной конференции, Минск, 23 ноября 2022 / Белорусский государственный университет информатики и радиоэлектроники ; редкол.: Л. Ю. Шилин [и др.]. - Минск : БГУИР, 2022. - С. 61-62. https://libeldoc.bsuir.by/handle/123456789/49250

11. Kuchkorov, T. A., Hamzayev, J. F., & Ochilov, T. D. (2021). INTELLEKTUAL TRANSPORT TIZIMI ILOVALARI UCHUN SUN'IY INTELLEKT TEXNOLOGIYALARIDAN FOYDALANISH. Вестник КГУ им. Бердаха. №, 2, 107.

12. Allamuratova Z.J. Hamzayev J.F., Kuchkorov T.A. (2022). Avtotransportlar oqimini intellektual tahlil qilishda chorrahalar turlarining tirbandlikka ta'siri. Вестник КГУ им. Бердаха. №, 3, 22-26.

13. Усманов Р. Н., Отениязов Р. И., Алламуратова З. Ж., Кучкаров Т. А. Геоинформационное моделирование природно-техногенных объектов на экологически

APRIL 27-28, 2023

напряженных территориях. Проблемы вычислительной и прикладной математики. -2017. - № 6(12). - С. 55-62. - EDN YUKLSH. 14. Allamuratova Z. J., Raxmonov Sh. M. Telekommunikatsiya transport aloqa tarmog'ini optimallashtirish. INTERNATIONAL CONFERENCE ON LEARNING AND TEACHING. №, 1, 83-86.

DEVELOPING A SPEECH EMOTION RECOGNITION SYSTEM USING CNN ENCODERS WITH ATTENTION FOCUS Текст научной статьи по специальности «Компьютерные и информационные науки»

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Valentina Mamutova, Alpamis Kutlimuratov, Temur Ochilov

Похожие темы научных работ по компьютерным и информационным наукам , автор научной работы — Valentina Mamutova, Alpamis Kutlimuratov, Temur Ochilov

Текст научной работы на тему «DEVELOPING A SPEECH EMOTION RECOGNITION SYSTEM USING CNN ENCODERS WITH ATTENTION FOCUS»