INTERNATIONAL SCIENTIFIC AND TECHNICAL CONFERENCE "DIGITAL TECHNOLOGIES: PROBLEMS AND SOLUTIONS OF PRACTICAL IMPLEMENTATION IN THE SPHERES" APRIL 27-28, 2023
CHALLENGES OF SPEECH EMOTION RECOGNITION SYSTEM MODELING AND
ITS SOLUTIONS Alpamis Kutlimuratov1, Elyor Gaybulloev2
department of Information-Computer Technologies and Programming, Tashkent University of Information Technologies Named after Muhammad Al-Khwarizmi, Tashkent 100200,
Uzbekistan.
2Department of Applied Informatics, KIMYO International University in Tashkent, Tashkent
100121, Uzbekistan. https://doi.org/10.5281/zenodo.7856088
Abstract. This paper presents potential solutions to overcome the challenges of speech emotion recognition (SER) system modeling, which can aid in developing more accurate and robust emotion recognition systems. Developing accurate and robust SER models requires addressing several challenges, including variations in speech patterns, limited emotional datasets, overfitting, and integrating contextual and multimodal information. The potential solutions presented in this paper can aid in developing SER systems that can benefit several application domains.
Keywords: Speech Emotion Recognition, Challenges, Data collection
Introduction
Speech Emotion Recognition (SER) is an emerging field that aims to develop computational models for automatically recognizing emotions in speech signals. The ability to accurately recognize emotions in speech has many potential applications [1,2], including in fields such as psychology, mental health, and human-computer interaction.
The task of SER is challenging due to the complexity of emotions and the many ways in which they can be expressed through speech. Emotions can be conveyed through various acoustic features, such as pitch, loudness, and duration, as well as through more subtle features like prosody and speaking rate. Additionally, emotions can be expressed differently in different languages and cultures, making cross-lingual and cross-cultural SER a particularly challenging area of research.
АЛЛ A
t'fet'fe
Speech Signal Deep Neural Network 111 = - Emotions
Figure 1. The process of SER system modeling
INTERNATIONAL SCIENTIFIC AND TECHNICAL CONFERENCE
"DIGITAL TECHNOLOGIES: PROBLEMS AND SOLUTIONS OF PRACTICAL IMPLEMENTATION IN THE SPHERES" APRIL 27-28, 2023
Despite these challenges, significant progress has been made in SER in recent years, with researchers developing a range of machine learning and deep learning algorithms [3,4] that can effectively recognize emotions in speech signals. These models are typically trained on large datasets of labeled speech data that have been annotated with emotion labels by human experts [8].
The potential applications of SER are vast and diverse, ranging from developing more natural and empathetic chatbots to improving social communication for individuals with autism or other developmental disorders. However, there are also limitations and challenges associated with SER, including the limited availability of annotated speech data and the variability in emotional expression.
Main part
With the increasing popularity of virtual assistants, smart homes, and human-robot interaction, SER systems have become more critical than ever. However, developing an accurate SER system is a challenging task due to the variability in emotional expression across different languages, cultures, and individuals. Additionally, speech signals can be affected by environmental factors such as background noise, reverberation, and speech disorders, further complicating the modeling process. In this context, this article discusses the challenges associated with SER system modeling and possible solutions to overcome them, emphasizing the latest developments in the field. While significant progress has been made in recent years, there are still several challenges in SER that researchers are trying to overcome. Here are some of the main challenges:
1. Data collection and annotation: The availability of large-scale annotated datasets is critical for training accurate SER models. However, collecting and annotating speech data with emotional labels is a challenging and time-consuming task. Data collection and annotation is one of the biggest challenges in SER (Speech Emotion Recognition), but there are several strategies that can be employed to address this challenge:
• Crowdsourcing: Crowdsourcing platforms like Amazon Mechanical Turk can be used to collect and annotate speech data. This approach can help to collect large volumes of data quickly and cost-effectively.
• Data augmentation: Data augmentation techniques can be used to increase the size of existing datasets by generating new samples from existing data. This approach can help to reduce the need for extensive data collection.
• Collaborations with speech therapy clinics: Speech therapy clinics can provide a source of data for emotion recognition research. These clinics often work with patients who have speech disorders, and the recordings from these sessions can be used to build emotion recognition models.
• Multilingual and multicultural data collection: Collecting speech data from multiple languages and cultures can help to increase the diversity of the dataset. This can help to improve the robustness of the emotion recognition system and ensure that it can generalize well across different populations.
• Active learning: Active learning is a machine learning technique that involves iteratively selecting samples for annotation based on their potential to improve the performance of the model. This approach can help to reduce the amount of labeled data required for training an accurate model.
INTERNATIONAL SCIENTIFIC AND TECHNICAL CONFERENCE
"DIGITAL TECHNOLOGIES: PROBLEMS AND SOLUTIONS OF PRACTICAL IMPLEMENTATION IN THE SPHERES" APRIL 27-28, 2023
• Transfer learning: Transfer learning is a technique that involves using a pre-trained model as a starting point for training a new model on a different dataset. This approach can help to reduce the amount of labeled data required for training a new model and can be particularly useful when working with limited data.
2. Variability in emotional expression: Emotions can be expressed in different ways, and people can have different emotional states even when they are expressing the same emotion[9,10]. This variability in emotional expression makes it difficult to develop a SER system that can accurately identify emotions across different speakers and contexts. There are multiple methods that can be utilized to address this obstacle:
• Multi-modal data fusion: Emotions can be expressed through various modalities, such as speech, facial expressions, and body language. Combining information from multiple modalities can help to improve the accuracy of emotion recognition systems.
• Feature selection: Feature selection involves [5] selecting the most relevant features that can help to distinguish between different emotions. This approach can help to reduce the effects of variability in emotional expression and improve the accuracy of emotion recognition systems.
• Contextual information: Contextual information, such as the topic of conversation or the speaker's background, can provide valuable cues for identifying emotions. Incorporating this information into emotion recognition systems can help to improve their accuracy.
• Adversarial training: Adversarial training involves training emotion recognition models to be robust to adversarial examples, where the input data is intentionally perturbed to mislead the model. This approach can help to improve the robustness of emotion recognition systems to variability in emotional expression.
3. Multimodal input: Emotions are expressed not only through speech but also through other modalities, such as facial expressions and body language. Incorporating these additional modalities into SER systems can improve their accuracy, but it also increases the complexity of the system. There are also several solutions that can be used to overcome this challenge:
• Feature extraction: Feature extraction involves extracting relevant features from different modalities, such as speech, facial expressions, and body language. By combining features from multiple modalities, researchers can develop more accurate and robust SER models.
• Data fusion: Data fusion involves combining data from multiple modalities into a single representation. This approach can help to improve the accuracy of emotion recognition systems by leveraging the complementary information provided by different modalities.
• Deep learning: Deep learning approaches [6,7], such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), can be used to learn representations from multimodal data. These representations can then be used to classify emotions, leading to more accurate and robust SER models[11].
4. Cross-lingual and cross-cultural differences: Different languages and cultures have different emotional expressions and norms. Developing a SER system that can accurately identify emotions across different languages and cultures is a challenging task. We can list some strategies that can be used to overcome this challenge:
• Multilingual and multicultural data collection: Collecting speech data from multiple languages and cultures can help to increase the diversity of the dataset. This approach can help to
INTERNATIONAL SCIENTIFIC AND TECHNICAL CONFERENCE
"DIGITAL TECHNOLOGIES: PROBLEMS AND SOLUTIONS OF PRACTICAL IMPLEMENTATION IN THE SPHERES" APRIL 27-28, 2023
improve the robustness of the emotion recognition system and ensure that it can generalize well across different populations.
• Feature normalization: Feature normalization involves standardizing the features used by emotion recognition systems to account for cross-lingual and cross-cultural differences. This approach can help to reduce the effects of differences in language and culture on the accuracy of the emotion recognition system.
• Domain adaptation: Domain adaptation involves adapting emotion recognition models to new domains, such as different languages and cultures. This approach can help to improve the accuracy of the emotion recognition system by incorporating the unique characteristics of different domains.
• Multimodal input: Multimodal input can be used to provide additional context that can help to account for cross-lingual and cross-cultural differences. By incorporating information from multiple modalities, researchers can develop more accurate and robust SER models that are better equipped to handle cross-lingual and cross-cultural differences [12].
5. Limited availability of training data: While there are some publicly available datasets for SER, they are often small and may not be representative of the broader population. This limits the ability to develop robust and accurate SER systems. There are various approaches that can be employed to tackle this difficulty:
• Semi-supervised learning: Semi-supervised learning involves training models on both labeled and unlabeled data. This approach can be useful when training data is limited, as it can help to leverage the unlabeled data to improve the performance of the emotion recognition system.
• Collaborative data sharing: Collaborative data sharing involves sharing datasets across different research groups or organizations. This approach can help to increase the size of the dataset and improve the accuracy of the emotion recognition system when training data is limited
[13].
In conclusion, SER systems play an essential role in improving human-machine interaction and creating more personalized services. However, building a robust and accurate SER system is still a challenging task due to various factors, such as individual differences in emotional expression, language barriers, and environmental factors. In this article, we have discussed some of the primary challenges associated with SER system modeling and presented several solutions to overcome them, such as deep learning techniques, feature engineering, and multimodal data fusion. Despite these solutions, SER research remains an active area of investigation, and further studies are necessary to enhance the performance and generalization capabilities of SER systems. Nonetheless, with the latest advancements in SER research and development, we can expect SER systems to become even more reliable and accessible in the future, paving the way for more natural and intuitive human-machine interaction.
REFERENCES
1. Kutlimuratov, A.; Abdusalomov, A.; Whangbo, T.K. Evolving Hierarchical and Tag
Information via the Deeply Enhanced Weighted Non-Negative Matrix Factorization of Rating
Predictions. Symmetry 2020, 12, 1930.
INTERNATIONAL SCIENTIFIC AND TECHNICAL CONFERENCE "DIGITAL TECHNOLOGIES: PROBLEMS AND SOLUTIONS OF PRACTICAL IMPLEMENTATION IN THE SPHERES" APRIL 27-28, 2023
2. Makhmudov, F.; Kutlimuratov, A.; Akhmedov, F.; Abdallah, M.S.; Cho, Y.-I. Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders. Electronics 2022, 11, 4047. https://doi.org/10.3390/electronics1123404
3. Ilyosov, A.; Kutlimuratov, A.; Whangbo, T.-K. Deep-Sequence-Aware Candidate Generation for e-Learning System. Processes 2021, 9, 1454. https://doi.org/10.3390/pr9081454.
4. Abdusalomov, A.B.; Mukhiddinov, M.; Kutlimuratov, A.; Whangbo, T.K. Improved RealTime Fire Warning System Based on Advanced Technologies for Visually Impaired People. Sensors 2022, 22, 7305. https://doi.org/10.3390/s22197305.
5. Kutlimuratov, A.; Abdusalomov, A.B.; Oteniyazov, R.; Mirzakhalilov, S.; Whangbo, T.K. Modeling and Applying Implicit Dormant Features for Recommendation via Clustering and Deep Factorization. Sensors 2022, 22, 8224. https://doi.org/10.3390/s22218224.
6. Safarov F, Kutlimuratov A, Abdusalomov AB, Nasimov R, Cho Y-I. Deep Learning Recommendations of E-Education Based on Clustering and Sequence. Electronics. 2023; 12(4):809. https://doi.org/10.3390/electronics12040809
7. Abdusalomov, A.; Baratov, N.; Kutlimuratov, A.; Whangbo, T.K. An Improvement of the Fire Detection and Classification Method Using YOLOv3 for Surveillance Systems. Sensors 2021, 21, 6519. https://doi.org/10.3390/s21196519.
8. Kuchkorov, T., Khamzaev, J., Allamuratova, Z., & Ochilov, T. (2021, November). Traffic and road sign recognition using deep convolutional neural network. In 2021 International Conference on Information Science and Communications Technologies (ICISCT) (pp. 1-5). IEEE. DOI: 10.1109/ICISCT52966.2021.9670228
9. Kuchkarov T. A., Hamzayev J. F., Allamuratova Z. J. Tracking the flow of motor vehicles on the roads with YOLOv5 and deepsort algorithms. Международной научной конференции, Минск, 23 ноября 2022 / Белорусский государственный университет информатики и радиоэлектроники ; редкол.: Л. Ю. Шилин [и др.]. - Минск : БГУИР, 2022. - С. 61-62. https://libeldoc.bsuir.by/handle/123456789/49250
10. Kuchkorov, T. A., Hamzayev, J. F., & Ochilov, T. D. (2021). INTELLEKTUAL TRANSPORT TIZIMI ILOVALARI UCHUN SUN'IY INTELLEKT TEXNOLOGIYALARIDAN FOYDALANISH. Вестник КГУ им. Бердаха. №, 2, 107.
11. Allamuratova Z.J. Hamzayev J.F., Kuchkorov T.A. (2022). Avtotransportlar oqimini intellektual tahlil qilishda chorrahalar turlarining tirbandlikka ta'siri. Вестник КГУ им. Бердаха. №, 3, 22-26.
12. Усманов Р. Н., Отениязов Р. И., Алламуратова З. Ж., Кучкаров Т. А. Геоинформационное моделирование природно-техногенных объектов на экологически напряженных территориях. Проблемы вычислительной и прикладной математики. -2017. - № 6(12). - С. 55-62. - EDN YUKLSH.
13. Allamuratova Z. J., Raxmonov Sh. M. Telekommunikatsiya transport aloqa tarmog'ini optimallashtirish. INTERNATIONAL CONFERENCE ON LEARNING AND TEACHING. №, 1, 83-86.