Научная статья на тему 'A CONFORMER BASED AUTOMATED SPEECH RECOGNITION FOR ARMENIAN LANGUAGE'

A CONFORMER BASED AUTOMATED SPEECH RECOGNITION FOR ARMENIAN LANGUAGE Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
169
37
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
AUTOMATED SPEECH RECOGNITION / CONFORMER / LANGUAGE MODEL / N-GRAMS / TRANSFORMERS / WORD ERROR RATE / TRANSFER LEARNING

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Karamyan Davit, Karamyan Tigran

The article is aimed to represent Armenian automated speech recognition model and its applications in different fields of economy. Because of the lack of Armenian speech corpora, in this article we fine-tuned the voice recognition and text symbol generating parts using a Conformer pre-trained model and a compact Armenian language model. The article focuses the attention of readers on the problem of recognizing human speech and transforming it to a text, especially for non-mainstream languages. The paper is prepared with scientific abstraction and a combined analysis of many recent implementations of the discussed approach. The sources' credibility, relevance and authenticity have been confirmed by their extensive research. As to conclude, though it is pretty challenging to develop ASR model for non-mainstream languages, it was proven that the employment of Conformer-based transformers in conjunction with language models is effective for Armenian speech recognition. It was also proven that the technique employed in this article is applicable for other languages too, with some adjustments.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «A CONFORMER BASED AUTOMATED SPEECH RECOGNITION FOR ARMENIAN LANGUAGE»

ԳԻՏԱԿԱՆ ԱՐՑԱԽ

SCIENTIFIC ARTSAKH

научный арцах

№ 2(13), 2022

A CONFORMER BASED AUTOMATED SPEECH RECOGNITION FOR

ARMENIAN LANGUAGE*

UDC 004.85 + 004.522 DOI: 10.52063/25792652-2022.2.13-224

DAVIT KARAMYAN

Russian-Armenian University,

Faculty of Informatics and Computer Engineering,

Department of Mathematical Modeling,

Computational Methods and Software Complexes, Ph.D. Student, Yerevan, Republic of Armenia da vitkar98@gmail. com

TIGRAN KARAMYAN

Yerevan State University,

Faculty of Economics and Management Department of Mathematical Modeling in Economics,

Lecturer, Ph.D. Student,

Yerevan, Republic of Armenia t.qaramvan@vsu.am

The article is aimed to represent Armenian automated speech recognition model and its applications in different fields of economy. Because of the lack of Armenian speech corpora, in this article we fine-tuned the voice recognition and text symbol generating parts using a Conformer pre-trained model and a compact Armenian language model.

The article focuses the attention of readers on the problem of recognizing human speech and transforming it to a text, especially for non-mainstream languages.

The paper is prepared with scientific abstraction and a combined analysis of many recent implementations of the discussed approach. The sources' credibility, relevance and authenticity have been confirmed by their extensive research.

As to conclude, though it is pretty challenging to develop ASR model for non-mainstream languages, it was proven that the employment of Conformer-based transformers in conjunction with language models is effective for Armenian speech recognition. It was also proven that the technique employed in this article is applicable for other languages too, with some adjustments.

Keywords: automated speech recognition, conformer, language model, n-grams, transformers, word error rate, transfer learning, NeMo.

Introduction. For more than 60 years, researchers have been working on automated speech recognition (ASR). During this period a lot of industrial products have been made that employed ASR and they was extremally helpful and prevalent for users. ASR makes machine turn the speech signal to the corresponding text or command after identifying and processing the voice signal. This process includes the extraction and

* Հոդվածը ներկայացվել է 25.05.2022թ., գրախոսվել' 22.06.2022թ., տպագրության ընդունվել' 10.07.2022թ.:

224

ԳԻՏԱԿԱՆ ԱՐՑԱԽ SCIENTIFIC ARTSAKH НАУЧНЫЙ АРЦАХ № 2(13), 2022

determination of the acoustic feature, the acoustic model, and the language model. Speech recognition relies heavily on the extraction and assessment of acoustic features. The extraction and identification of the acoustic feature is both an information compression and a signal deconvolution operation (Zhongzhi 580-585).

One of the main challenges with ASR is the problem of handling a wide variety of real-world noise and other acoustic distorting situations. Because of the diversity and complexity of the voice signal, the current speech recognition system can only be satisfied based on performance under specified situations or employed in specific circumstances. In spite of that consumer-oriented applications or products increasingly require detecting spoken words accurately in real world situations that still remains a difficulty.

Most of the languages in the world are considered non-mainstream languages (Precoda 229-243). One of such languages is Armenian, which is mainly used in the territories of historical Armenia and within Armenian Diaspora. Non-mainstream languages require large amounts of resources to create speech recognition models or applications, which, however, can be costly or inaccessible for those languages. One of the main issues is to collect enough audio data from different speakers which is necessary for the speech recognition model to understand all the tones of the speech, because people's voices differ from each other in theme, speed of pronunciation, tone of voice, etc. The second problem is the phonological diversity of sound data. The large number of sound phonemes (diphone, triphone, etc.), combined with the diversity of speakers, leads to the need for thousands of vocabulary expressions, which complicate the data collection process. The third challenge that is often encountered when working with less common languages is the lack of online glossaries (with appropriate phonetic transliterations). And the fourth most common problem is the orthographic and literary differences of languages.

The development of ASR technology and the use of digital assistants have moved quickly from cellphones to homes, and their application in fields such as business, finance, marketing, and healthcare is quickly becoming apparent. Further, we will discuss some applications of Armenian ASR in economics.

Literature Review. Recently, Transformer and Convolution neural network (CNN)-based models have outperformed Recurrent neural networks in Automatic Speech Recognition (ASR) (RNNs). Transformer models excel in capturing content-based global interactions, whereas CNNs excel at leveraging local characteristics. Gulati (Gulati, et al), et al. in 2020 integrated convolution neural networks with transformers to describe both local and global dependencies of an audio sequence in a parameter-efficient manner, that according to authors gave the best of both techniques. In this context, they proposed the Conformer convolution-augmented transformer for voice recognition. Conformer beats prior Transformer and CNN-based models, attaining cutting-edge accuracies.

After the introduction of Conformers, many modifications have been applied to them. Neural architecture search (NAS) has been effectively employed in image classification, natural language processing, and automated voice recognition (ASR) applications to locate state-of-the-art (SOTA) structures rather than those built by humans. Over validation data, NAS can develop a SOTA and data-specific architecture using a search algorithm from a pre-defined search space. Inspired by the success of NAS in ASR tasks, Liu, et al. (Liu, et al.) offer Differentiable Architecture Search, a NAS-based ASR framework with one search space and one differentiable search algorithm (DARTS). The search space described in the article is built on the backbone of a convolution-augmented transformer (Conformer), which is a more expressive ASR architecture than those utilized in other NAS-based ASR frameworks. On the widely known Mandarin benchmark AISHELL-1, the best-searched architecture greatly

225

ԳԻՏԱԿԱՆ ԱՐՑԱԽ SCIENTIFIC ARTSAKH НАУЧНЫЙ АРЦАХ № 2(13), 2022

outperforms the baseline Conform model with an 11 percent character error rate (CER) relative improvement.

The utilization of a competitive conformer-based hybrid ASR (Zeineldeen, et al. 7437-7441) has been also shown to be very effective. After employing several approaches for lowering WER and increasing training speed, Zeineldeen, et al. used temporal down-sampling methods for efficient training and transposed convolutions to upsample the output sequence again. This technique generalizes very well on the Switchboard 300h test set Hub5’01 and outperforms the BLSTM-based hybrid model significantly.

In 2022 Yang, et al. (Yang, et al.) proposed a Conformer-based acoustic model for robust ASR that was developed and tested on the CHiME4 challenge's monaural task. Their suggested model obtains 6.25 percent word error rate (WER) on real test data when combined with utterance-wise normalization and iterative speaker adaption, beating the previous best system by 8.4 percent approximately. Furthermore, as compared to the baseline system, the suggested model has an 18.3 percent lower model size and requires 79.6 percent less training time. They fixed the number of Conformer encoder blocks in the proposed system at two, as experiments with extra encoder blocks yield no better results.

Several companies/individuals have taken up the issue of developing an Armenian ASR. Google's speech-to-text recognition added 21 languages in 2017, bringing the total number of supported languages to 119, with Armenian being one of them. It is available through the Cloud Speech API. In 2018, Ucom created an artificial intelligence-based automated Armenian speech-to-text system. ICAN Development Company began work on the Armenian NeuroNetwork project in 2019.

Dataset. The main dataset used in training the ASR model for Armenian language is audio and text data collected from about 40 historical Armenian stories. We gave the naming "Stories-15'' to that dataset. The number “15” in the naming of the dataset represents actual duration of annotated audio data. The length of the annotations varies from 1 second to 20 seconds. Annotations with different lengths allow more accurate understanding of sounds, since same words (same phonemes) appear in different contexts with different pronunciation. The "Stories-15" database contains recordings of 12 speakers, 9 of which are recordings for men (approximately 9 hours and 20 minutes) and 3 for women (approximately 5 hours and 45 minutes). The “Stories-15” include translations of foreign literature as well as various works by famous Armenian writers.

We also used some audio data from Mozilla Common Voice1, that provides open-source data to improve voice recognition techniques for many languages. In April 2022 the latest version of Common Voice was released: "Mozilla Common Voice 9.0". The size of the Armenian dataset is 90 MB, duration is 4 hours, the count of readers is 60, but we used only half of that data as it was validated.

Summing all up: the whole dataset (“Stories-15” + Mozilla Common Voice) duration is 17 hours, the 16 of which is used for training the model, 0.5 hour for validation and 0.5 hour for test.

ASR Model. To develop Armenian ASR here we employ Conformer-CTC-Medium model for English automatic speech recognition, trained on NeMo ASRSET1 2.This collection comprises Conformer-CTC medium size versions (about 30M parameters) trained on NeMo ASRSet with around 16000 hours of English speech. The model transcribes speech using the lowercase English alphabet, spaces, and apostrophes. Since phonetic structure of Armenian and English is similar, we decided to employ the

1 Mozilla Common Voice - https://commonvoice.mozilla.org/en/datasets

2 STT En Conformer-CTC Medium

226

ԳԻՏԱԿԱՆ ԱՐՑԱԽ SCIENTIFIC ARTSAKH НАУЧНЫЙ АРЦАХ № 2(13), 2022

abovementioned model. Conformer-CTC model for automatic speech recognition is a non-autoregressive variation of the Conformer model described earlier in this article that employs CTC1 loss/decoding instead of Transducer. Also, SpecAugment (Park, et al.) and SpecCutout is employed for model training. SpecAugment is a straightforward data augmentation technique for voice recognition. It is applied directly to a neural network's feature inputs (i.e., filter bank coefficients). Warping the features, masking blocks of frequency channels, and masking blocks of time steps comprise the augmentation strategy. Regarding SpecCutout (DeVries and Taylor), the authors demonstrate how a basic regularization approach known as cutout, which involves randomly cutting out square portions of input during training, may be utilized to increase the resilience and overall performance of convolutional neural networks. Not only is this approach simple to implement, but we also show how it may be used in conjunction with existing kinds of data augmentation and other regularizers to boost model performance even further.

Results. After freezing the first 14 blocks of Conformer and employing SpecAugment and SpecCutout, we have trained the Armenian ASR model on the resting blocks with fine-tuning. As an evaluation metric WER1 2 was used.

Table 1. Model results

WER - Word Error rate without LM with LM

Comformer-medium-english + Freezed first 14 blocks + SpecAugment + SpecCutout + 0.68 0.369

As shown in Table 1, the word error rate for just pure Conformer-CTC model without any language model is 68 percent when with 10-gram language model we are able to get 37 percent. Here we used Bye Pair Encoding to build a subword language model for Armenian. As a result, we obtained a compact (100 MB) subword language model trained on massive Armenian corpora.

Applications of ASR. To give the reader an idea of what to expect from the ASR technology, here we will show some applications in business and marketing.

The banking and financial industries aim to employ voice recognition to eliminate consumer friction. Voice-activated banking has the potential to significantly reduce the requirement for human customer support while also lowering personnel expenses. In result, a personalized financial assistant might increase consumer satisfaction and loyalty. With the help of speech recognition in banking one can:

• Without having to unlock a phone, get information about user’s balance, transactions, and spending patterns.

• Make payments

• Get details about transaction history.

In marketing, voice-search has the potential to change the way marketers communicate with their customers. Marketers should watch for growing trends in user data and behavior as people's interactions with their gadgets evolve. With voice recognition, marketers will have access to a new form of data to analyze. Accents, speech patterns, and vocabulary can be used to assess a consumer's location, age, and other demographic information, such as cultural affinity. As for behavioral analysis of customers, marketers may need to focus on long-tail keywords and create conversational content, since speaking allows for longer, more conversational searches.

1 Connectionist Temporal Classification - https://distill.pub/2017/ctc/

2 Word Error Rate (WER) is a metric that measures how accurate an Automatic Speech Recognition (ASR) system is. It calculates the number of "errors" in the transcription text produced by an ASR system when compared to a human transcription.

227

ԳԻՏԱԿԱՆ ԱՐՑԱԽ SCIENTIFIC ARTSAKH НАУЧНЫЙ АРЦАХ № 2(13), 2022

Conclusion. As to conclude in this paper we represent Armenian ASR model. The state-of-the-art Conformer transformer was chosen as a basis of that model that is trained on audio book and Mozilla Common Voice data. In conjunction with SpecAugment and SpecCutout augmentation techniques, our best model yields 68 percent word error rate. With the employment of 10-gram language model that was trained on massive Armenian corpora we were able to better the performance of the model and get 37 percent word rate error. It is also planned to enlarge the Armenian speech corpora for further development of Armenian ASR.

WORKS CITED

1. DeVries, Terrance, and Taylor, Graham. “Improved Regularization of Convolutional Neural Networks with Cutout”. 2017, arXiv preprint arXiv:1708.04552.

2. Gulati, Anmol, et al. “Conformer: Convolution-augmented Transformer for Speech Recognition.” 2020, arXiv preprint arXiv:2005.08100.

3. Liu, Yan, et al. “Improved Conformer-based End-to-End Speech Recognition Using Neural Architecture Search.” 2021, arXiv preprint arXiv:2104.05390.

4. Park, Daniel, et al. “Specaugment: A Simple Data Augmentation Method for Automatic Speech Recognition.” 2019, arXiv preprint arXiv:1904.08779.

5. Yang, Yufeng, et al. “A Conformer Based Acoustic Model for Robust Automatic Speech Recognition.” 2022, arXiv preprint arXiv:2203.00725.

6. Zeineldeen, Mohammad, et al. “Conformer-based Hybrid ASR System for Switchboard Dataset”. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.

7. Zhongzhi, Shi. Intelligence Science. Tsinghua University Press, Elsevier, Bejing, 2021.

8. Precoda, Kristin. “Non-Mainstream Languages and Speech Recognition: Some Challenges”. The CALICO Journal. Vol. 21, No. 2, 2004, pp. 229-243.

ԿՈՆՖՈՐՄԵՐԻ ՎՐԱ ՀԻՄՆՎԱԾ ԽՈՍՔԻ ԱՎՏՈՄԱՏԱՑՎԱԾ ՃԱՆԱՉՈՒՄ

ՀԱՅԵՐԵՆԻ ՀԱՄԱՐ

ԴԱՎԻԹ ՔԱՐԱՄՅԱՆ

Հայ–ռուսական համալսարանի

ինֆորմատիկայի և համակարգչային տեխնիկայի ֆակուլտետի, մաթեմատիկական մոդելավորման, թվային մեթոդների և ծրագրերի համալիրների ամբիոնի ասպիրանտ, ք. Երևան, Հայաստանի Հանրապետություն davitkar98@gmail. com

ՏԻԳՐԱՆ ՔԱՐԱՄՅԱՆ

Երևանի պետական համալսարանի տնտեսագիտության և կառավարման ֆակուլտետի տնտեսագիտության մեջ մաթեմատիկական մոդելավորման ամբիոնի դասախոս, ասպիրանտ, ք. Երևան, Հայաստանի Հանրապետություն t.qaramyan@ysu.am

Հոդվածը նպատակ ունի հանրությանը ներկայացնել խոսքի ավտոմատացված ճանաչման մոդելը հայերենի համար և այդ մոդելի կիրառությունը տնտեսության տարբեր ոլորտներում: Հայերեն խոսքի տվյալների բազայի բացակայության պատճառով հոդվածում կատարելագործել ենք ձայնի ճանաչման և տեքստի սիմվոլների գեներացման մոդելները' օգտագործելով Conformer նախապես «վարժեցված» մոդելը և հայերենի համար ստեղծված կոմպակտ լեզվի մոդելը (LM):

228

ԳԻՏԱԿԱՆ ԱՐՑԱԽ SCIENTIFIC ARTSAKH НАУЧНЫЙ АРЦАХ № 2(13), 2022

Հոդվածն ընթերցողների ուշադրությունը կենտրոնացնում է մարդու խոսքի ճանաչման և խոսքը տեքստի վերածելու խնդրի վրա, հատկապես քիչ տարածված լեզուների համար։

Հոդվածը շարադրված է գիտական վերացարկման և քննարկված մոտեցման վերջին կիրառությունների համակցված վերլուծության հիման վրա: Աղբյուրների

արժանահավատությունը, համապատասխանությունը և արդիականությունը ստուգվել են գրականության լայնածավալ հետազոտությունների ընթացքում:

Եզրակացությունն այն է, թեև բավականին դժվար է խոսքի ավտոմատացված մոդելի մշակումը քիչ տարածված լեզուների համար, սակայն հոդվածում ապացուցվել է, որ Conformer-ի վրա հիմնված տրանսֆորմեր-մոդելների օգտագործումը լեզվի մոդելների հետ համատեղ արդյունավետ է հայերեն խոսքի ճանաչման համար: Նաև ապացուցվել է, որ այս հոդվածում կիրառված մեթոդը կիրառելի է նաև այլ լեզուների համար՝ որոշակի ճշգրտումներով:

Հիմնաբառեր' խոսքի ավտոմատացված ճանաչում, Կոնֆորմեր, լեզվի մոդել, N-գրամ, տրանսֆորմերներ, բառի սխալի մակարդակ, փոխանցվող ուսուցում, NeMo:

АВТОМАТИЧЕСКОЕ РАСПОЗНАВАНИЕ РЕЧИ НА АРМЯНСКОМ ЯЗЫКЕ НА ОСНОВЕ КОНФОРМЕРОВ

ДАВИД КАРАМЯН

аспирант кафедры математического моделирования, вычислительных методов и программных комплексов факультета информатики и вычислительный техники Российско-Армянского университета, г. Ереван, Республика Армения

ТИГРАН КАРАМЯН

преподаватель и аспирант кафедры математического моделирования в экономике факультета экономики и финансов Ереванского государственного университета, г. Ереван, Республика Армения

Цель статьи - представить модель распознавания армянской автоматизированной речи и ее применения в различных сферах экономики. Из-за отсутствия армянских речевых данных в этой статье мы доработали части распознавания голоса и генерации текстовых символов, используя предварительно обученную модель Конформер и компактную модель армянского языка.

Статья акцентирует внимание на проблеме распознавания человеческой речи и преобразования ее в текст, особенно для неосновных языков.

Статья подготовлена с использованием метода научной абстракции и комбинированным анализом многих последних методов реализации обсуждаемого подхода. Надежность, актуальность и подлинность источников подтверждены их объемными исследованиями.

В ходе исследования мы пришли к заключению, что, хотя разработать модель распознавания речи для неосновных языков довольно сложно, было доказано, что использование преобразователей на основе Конформеров в сочетании с языковыми моделями эффективно для распознавания армянской речи. Также было доказано, что методика, использованная в этой статье, применима и для других языков, с некоторыми корректировками.

Ключевые слова: автоматическое распознавание речи, конформер, языковая модель, n-граммы, трансформеры, процент ошибок в слове, трансферное обучение, NeMo.

229

i Надоели баннеры? Вы всегда можете отключить рекламу.