Electronic Journal «Technical Acoustics» http://www.ejta.org
2007, 22
M. Chandrasekar1, M. Ponnavaikko2
1Centre of Excellence in TQM, SRM University, Chennai, India, e-mail: chandramu2000@yahoo. co. in
2Bharathidasan University, Tamilnadu, India, e-mail: [email protected]
Spoken TAMIL Character Recognition
Received 13.11.2007, published 07.12.2007
Speech is one of the most complex signals and the powerful tool for communication. It has been a long desire of the scientists that the machine should recognize the speech of the human beings either for the machine to function on voice commands or for giving a text output of the speech. Automatic recognition of speech by machine has been a goal of research for more than four decades. Now speech recognition tool has become a necessity for busy executives and industrial applications. Since beginning the research in this direction has been concentrating on English Speech recognition. Only from the last few years works are being carried out for recognizing speech in other languages. The Indian languages are structurally and syntactically different from Latin. This paper presents an approach for the recognition of spoken characters in Indian languages particularly Tamil using acoustic features of individual letters. A three layered back propagation neural network approach used for solving the problem is presented. The efficiency of the method presented is highlighted by applying the same to Tamil characters recognition.
INTRODUCTION
Speech is the human’s most efficient method of communication. Natural speech is continuous. The pronunciation of words and the style of speaking varies from person to person and place to place. It is thus a complex process to recognize speech with person independent and place independent. The development process in this area of research starts with constraints and assumptions, like person and place dependent. Further the environment where speech occurs is normally polluted with noise. Hence natural speech to text recognizing process will have recorded speech signal with noises. The recognizing process need to have a noise filtering mechanism. The speech includes a set of sentences. A sentence consists of words and word consists of letters. The speech recognition methods developed for spoken English recognizes words with the help of dictionaries. English speech recognizing techniques, do not attempt to recognize the letters in a word, since the pronunciation of an English word is not a composition of the pronunciation of the letters that form the word. But the Indian Language, Tamil in particular, are different. The pronunciation of a Tamil word is a composition of the pronunciation of the letters that form the word. Hence for Tamil Speech Processing no dictionary is required. The letters in the word can be recognized and the word
can be printed in text mode. Thus, the approach for recognizing Tamil speech can be different from that used for English speech to text processing. Therefore, it is proposed to develop a method for Tamil Speech processing in three stages without the use of dictionaries.
They are:
(i) Segmentation of sentences from the speech.
(ii) Segmentation of words from the segmented sentences.
(iii) Segmentation of characters from the segmented words.
The pronunciation of Tamil words is almost a joint of the pronunciation of letters. However, in some cases, the sound of each character either gets shortened or elongated depending upon the situation.
A complete speech recognition methodology should take care of all these aspects for an acceptable level of Tamil speech recognition.
The authors in their research have developed a solution procedure to achieve the above goal. Their initial effort is to recognize stand-alone Tamil characters. This paper presents a methodology for recognizing stand-alone spoken Tamil characters. The method developed uses Linear Prediction Cepstral Coefficients derived from Linear Prediction Coding.
As stated earlier, the research in this direction started with recognizing words and phonemes. The method developed by Davis K. H. et al. recognizes words. The system designed by Davis K. H. et al. for isolated word recognition for a single speaker relied on spectral measurements [1]. Olson and Belar recognized few monosyllabic words of a single speaker by measuring spectral features during vowel regions [2]. A phoneme recognizer was built by Fry and improved the phoneme accuracy of words consisting of two or more phonemes [3]. Using filter bank analyzer Forgie J. W. and Forgie C. D. constructed the vowel recognizer to recognize vowels in a speaker independent manner [4].
Suzuki and Nakata built a special hardware system to recognize vowels having an elaborate filter bank analyzer [5]. A hardware phoneme recognizer was designed by Sakai and Doshita [6]. The digit recognizing hardware of Nagata et al. from Japan paved the way for highly productive research in speech recognition [7]. Juang B. H et al. [8] discussed about the challenges in speech processing in future.
Jiang Mingu et al. put forward improved methods of Time Delay neural networks in phoneme recognition [9]. This improved method resulted in increase in the convergence speed and hence the training time is reduced.
Chandrasekar C., and Sivarama Krishna Rao J. Y. proposed an approach to recognize Consonant Vowel (CV) units in Indian languages using modular neural network and this system was used for names only [10]. Ganga Shetty S. V., and Yagnanarayana B. presented neural network models for recognition of syllable like units in Indian languages [11]. Multi layer feed forward neural network was used for classification of 145 CV units into 9 sub groups.
Pusateri E., and Thong J.M. employed a three emitting-state Gaussian mixture HMMs [12] yielded phoneme recognition accuracy of 69%. Salomon, J., et al. used an approach [13] Support Vector Machines (SVMs) with frame wise classification, reported 70.6% accuracy of correctly classified frames.
Mporas I. and Fakotakis N. described an approach of producing N-best list of hypothesis to recognize phonemes and the use resulted in competitive performance with the existing recognizers [14]. In a comparative evaluation of various classification methods, the SVM-based phoneme recognizer demonstrated by Mporas I. et al. [15], showed a good accuracy performance with a recognition rate of 74.2% by applying the language model.
1. METHODOLOGY
The methodology proposed by the authors uses Back Propagation Network (BPN) paradigm of neural network with the help of Matlab software.
For speech recognition, features like intensity, pitch, short time spectrum, formant frequencies, features from Linear Predictive Coding (LPC), LPC cepstrum features and mel frequency cepstrum coefficients of the speech wave are necessary[16]. The feature used in this approach is Linear Prediction Cepstral Coefficients derived from LPC [17]. The cepstral coefficients are extracted from the wave signal and are stored in a file. The wave signals for three Tamil characters are shown in figures 1 to 3.
Figure 2. Wave form of character ‘p’
BPN structure includes at least three layers, an input layer, an output layer and one or more hidden layers. For this problem BPN system is designed with 3 layers having one hidden layer. The number of neurons in each layer has its own effects on the accuracy and efficiency of the solution process.
1.1. Input layer specification
Input layer should have that many number of neurons so as to effectively transfer the speech wave signal into the process of recognition. A study of the speech waves given in figures 1 to 3 shows that the waves are cyclic with certain number of wave cycles. As stated earlier, the input to the system is the cepstral coefficients extracted from the speech waves using Matlab software . The speech wave of a character can be divided into any number of sections varying from 1 to infinity. Each section will be represented by one cepstral coefficient. In the input layer each cepstral coefficient will be represented by one neuron. If the wave is divided into n number of sections, then there will be n number of neurons in the input layer. When n is small the wave section will be large and the inner details of the wave will be approximated heavily. When n is large the wave is presented with inner details leading to high accuracy; but the computational effort will be high. Hence an optimum number of neurons for the input layer have to be fixed judiciously. The wave pattern shows that it has about 12 cycles. If each cycle is considered as one section, then there will be 12 cepstral coefficients and hence 12 neurons in the input layer. Through trials it was observed that 18 neurons in the input layer give better results.
1.2. Output layer specification
The output layer should have that many number of neurons so that it can be matched with the stored pattern effectively and the character would be recognized. It was observed through trials that when the number of neurons are as many as the wave cycles, it gives better results. Therefore, the number of neurons in the output layer is fixed as 12.
1.3. Hidden layer specification
The hidden layer can have any number of neurons. When the number of neurons in the hidden layer is increased, the accuracy of the system may increase, but at the same time the computational effort will be high. By trial it was found that 15 number of neurons in the hidden layer gives better results.
1.4. Process modeling
The BPN is a powerful technique used for speech recognition. BPN has the ability to recognize mappings by example, that is, it will generate input output pair relationships [18].
For each character, a selected number of sample training waves and one target wave of the speech are considered for training the system. Thus the input vector for BPN is the normalized cepstral coefficients of the training waves, derived from the speech waves. The
corresponding target cepstral coefficients are used for comparing output values. These
cepstral coefficients are presented as input to the input layer of the BPN. The algorithm for BPN is as given below.
1) Present n set of cepstral coefficients of the first training wave and m set of cepstral coefficients of the corresponding target wave as input to the input layer.
2) Obtain the sum of the weighted inputs to the hidden layer y- using (1):
y = X w x ’ (1)
where x is the input cepstral coefficients, which connects thej-th unit of the hidden layer and wfi is the weight associated with that connection; i = 1 to n, j = 1 to h where n is the number of neurons in the input layer and h is the number of neurons in the hidden layer.
3) Generate the output of the hidden layer zj, transferring yj using a non-linear activation function g (.):
g ( y, ) =
1 + exp(-y,) j
Z, = g (y, )
(2)
(3)
v1 + exP(-y,) j
(4)
4) Obtain the sum of the weighted inputs to the output layer ak using (5):
ak = I w, z,
(5)
where I- sends connection to the k-th unit of the output layer and wkj is the weight associated with that connection; k = 1 to m where m is the number of neurons in the output layer.
5) Generate the final output of the output layer Ok, transferring ak using a non-linear activation function g(.):
g(ak)=
1
1 + exp(-ak )
k / j
Ok = g (ak)
(6)
(7)
1
1 + exp(-ak) j
(8)
6) Compare output activations Ok, to the target values Tk for the pattern and calculate error E for the given pattern using (9):
E = \I(Tk- Ok)2.
(9)
7) Now to propagate the error backwards, calculate its error signal 5k of the output layer neurons using (10):
¿k = ok (1 - ok )(t- ok). (10)
1
1
(11)
9) Calculate the delta weight Aw,- and update the network weights wk- between hidden to output layer using (12) and (13):
where n is the learning rate and a is the momentum term.
Learning rate coefficient determines the size of the weight adjustments made during each iteration and hence influences the rate of convergence. Poor choice of the coefficient can result in failure in convergence. If the learning rate coefficient is too large, the search path will oscillate and will converge more slowly. If the coefficient is too small, the convergent time is significantly increased. If the learning coefficient is zero, no learning takes place, and hence learning coefficient must be more than zero. If the learning coefficient is greater than
1.0, the weight vector will overshoot from its ideal position and oscillate. Hence the learning coefficient must be between zero and one.
Adding momentum term a also improves the rate of convergence of this algorithm. This can be accomplished by adding a fraction of the previous weight change to the current weight change. The addition of such a term helps to smoothen out the descent path by preventing extreme changes in the gradients due to local anomalies. The value of momentum term must be positive but less than 1.00.
10) Calculate the delta weight Aw- and update the network weights wjj between input to hidden layer using (14) and (15):
Awj, = nj + aAw-j, (14)
(new) = wji (old) + Aw— . (15)
11) Present n set of cepstral coefficients of the second training wave and the cepstral coefficients of the target wave as input to the input layer.
12) Repeat the steps 1 to 11 with the updated new weights until all the training patterns are presented.
13) Sum the errors of all the training patterns, and obtain the average of the errors and compare the same with the specified tolerance.
14) Repeat the entire process from steps 1 to 13 till the convergence criteria is met
AwV =n^kzj >
(12)
Wj.(new) = Wj.(old) + AWj.,
(13)
2. SOLUTION PROCEDURE
2.1 Input Process
A corpus of 247 letters was recorded in an environment without background noise with 20 recordings of each character. The speech was recorded at 22050 Hz with a resolution of 16 bits mono using Visual Basic sound recorder program and directly loaded to the computer.
The voice signal consists of three parts namely speech segment, silence segment and background noise. In this recording the background noise is almost zero. There are blank spaces before and after the utterance of the characters. The amplitude of the speech signal varies appreciably with time. In particular the amplitude of the unvoiced signal is much lower than the amplitude of the voiced signal. The short time energy of the speech signal provides a convenient representation to detect voiced and unvoiced portions of the speech. The unvoiced blank spaces before and after the utterances are eliminated by measuring short time energy of the wave [16].
2.2 Training
The cepstral coefficients extracted from training speech waves for input and from target waves using Matlab program are stored in text files.
The cepstral coefficients of one of the training waves of the character ‘^’ before normalization are given below:
0.102 -1.292 0.162 -0.053 -0.279 0.193 0.152
0.094 0.179 0.225 0.179 0.094 0.152 0.193
-0.279 -0.053 0.162 -1.292
The corresponding target pattern value for ‘^’ are given below:
0.009 -0.983 0.502 0.017 -0.458 0.025 0.354 0.025
-0.458 0.017 0.502 -0.983
These cepstral coefficients are normalized between 0.05 and 0.95 so as to be within the domain of the activation function as discussed below.
The activation function used in the BPN algorithm is the sigmoidal function. The sigmoidal function z- will assume a value between 0 and 1 at yj = - <» and yj = <» respectively. Hence it is desirable to normalize the input and target value to lie between 0.05 and 0.95 Eighteen cepstral coefficients of the 20 different patterns for each of the 247 Tamil characters are normalized and twelve cepstral coefficients for the standard target waves for 247 characters are normalized and stored.
The normalized cepstral coefficients extracted of the input waves are given below:
0.877027 0.050000 0.912624 0.785069 0.650989 0.931015 0.906691 0.872281 0.922709 0.950000 0.922709 0.872281 0.906691 0.931015 0.650989 0.785069 0.912624 0.050000
Corresponding normalized values of the target wave of the character ‘^’:
0.651212 0.050000 0.950000 0.656061 0.368182 0.660909 0.860303 0.660909 0.368182 0.656061 0.950000 0.050000
Input to the 15 neurons in the hidden layer y- are computed selecting random weights wjj during the first iteration using (1).
The computed input y- to the hidden layer are given below:
2.434063 0.198616 1.083838 0.588452 -0.016964
-0.861201 1.134364 0.467219 0.630558 -0.651665
0.438409 -0.407622 -1.496799 -2.070665 -1.054792
The output of the hidden layer z- are computed using (4) and computed values of z- are given below:
0.919388 0.549491 0.747220 0.643010 0.495759
0.297088 0.756643 0.614725 0.652616 0.342614
0.607880 0.399482 0.182903 0.111981 0.258306
Input to the 12 neurons in the output layer ak are computed selecting random weights wkj and using (5).
The computed values of ak are given below:
0.189151 -0.152733 0.016499 0.654867 0.642569
-0.128540 0.638012 0.697150 -0.535250 0.668006
0.903000 -0.424632
The output of the output layer Ok are computed using (8) and computed values of Ok are given below:
0.547147 0.461891 0.504125 0.658106 0.655334
0.467909 0.654304 0.667556 0.369293 0.661056
0.711566 0.395409
The error E is calculated using (9) is given as E = 0.327065 and is stored.
Now for propagating the error backwards, calculate the error signal 5k of output layer neurons, delta weight Awk- between output to hidden layer and new weights w- between output to hidden layer using the equations (10) and (13) respectively.
Calculate error signal 8j of hidden layer neurons, delta weight Aw-- between hidden to
input layer and calculate the new weights wjj between hidden to input layer using the equations (11) and (15) respectively.
Present the cepstral coefficient values of the second training wave as input to the BPN and repeat above procedure with the updated new weights. Calculate the error and accumulate with the error of the first training wave.
The above procedure is repeated until the cepstral coefficients of the 20th training wave are presented as input. Calculate the average error of all the 20 training waves of the given pattern and compare with the specified tolerance.
Repeat the entire process till the desired error value is reached.
The modified final weights between input to hidden layer and modified final weights between hidden to output layer for the last iteration are called as optimized weights and these optimized weights are stored. The normalized cepstral coefficients of the target waves Tk are stored as patterns.
3. APPLICATION
Testing is carried by presenting the cepstral coefficients of the testing wave extracted from the Matlab software, the optimized weights between input to hidden layer and hidden to output layer and the cepstral coefficients of the corresponding target wave.
Steps 1 to 5 of the processing module are used to calculate the final output values.
The eighteen cepstral coefficients of the testing wave before normalization are given below:
0.152 -1.219 0.285 -0.006 -0.275 0.150 0.183 0.081 0.135 0.133 0.135 0.081 0.183 0.150 -0.275 -0.006 0.285 -1.219
The normalized of the cepstral coefficients of the input wave are given below:
0.870412 0.050000 0.950000 0.775864 0.614894
0.869215 0.888963 0.827926 0.860239 0.859043
0.860239 0.827926 0.888963 0.869215 0.614894
0.775864 0.950000 0.050000
Following the procedure described in section 2 and 3, the output of the output layer Ok are computed using (8) and computed values of Ok are given below:
0.650957 0.051107 0.949791 0.657495 0.369336 0.659827 0.860326 0.659714 0.367090 0.654864 0.949792 0.051207
The normalized cepstral coefficients of the target pattern Tk as stored in the system are given below:
0.651212 0.050000 0.950000 0.656061 0.368182 0.660909 0.860303 0.660909 0.368182 0.656061
0.950000 0.050000
Compare the corresponding values of the calculated output with the desired target values. The error in each 12 cepstral coefficients is less than 0.0015. Hence the character is printed on the screen as ‘^’.
If the difference is less than the desired value then the corresponding wave pattern is printed on the screen.
CONCLUSION
Employing 247 characters of Tamil language, the cepstral coefficients are extracted and given as input to the 3 layer back propagation network with a simulation process having 18 neurons in the input layer, 12 neurons in the output layer and 15 neurons in the hidden layer. The 247 characters are recognized with an accuracy of 0.0015.
The proposed scheme is tested for a given speaker without background noise and has proved to be an effective approach for spoken Tamil character recognition. However, this was tested for a given speaker. It is not tested for speaker independent. But it is possible to train the system for person independent application and with noise.
The research candidate is thankful to the Tamil Virtual University, SRM University and Bharathidasan University for providing all facilities for carrying this work.
REFERENCES
1. Davis K. H., Biddulph R., and Balashek S. Automatic recognition of spoken digits,
Journal of Acoust. Soc. of America, vol. 24, 637-642, 1952.
2. Olson H. F., and Belar H. Phonetic Typewriter. Journal of Acoust. Soc. of America.,
28(4), 1072-1081, 1956.
3. Fry D.B. Theoretical aspects of mechanical speech recognition. J. British Inst. Radio Engineer, 19(4), 211-218, 1959.
4. Forgie J. W., and Forgie C. D. Results obtained from vowel recognition Computer program. J. Acoust. Soc. of America, 31(11), 1480-1489, 1959.
5. Suzuki J., and Nakata K. Recognition of Japanese vowels-preliminary to the recognition of speech. J. Radio Res. Lab, 37(8), 193-212, I961.
6. Sakai T., and Doshita S. The phonetic typewriter. Information Processing 1962, Proc.
IFIP. Congress, Munich, pp. 445-450, 1962.
7. Nagata K., Kato Y., and Chiba S. Spoken digit recognizer for Japanese language. NEC Res. Develop., 6, 1963.
8. Juang B. H., Childers D., Cox R. V., DeMori R., Furui S., Mariani J. J., Price P.,
Sagayama S., Sondhi M. M., and Weischedel R. The past, Present, and Future of Speech processing. IEEE signal processing magazine, pp. 24-48, 1998.
9. Jiang Minghu, Yuan Baozong, Tang Xiaofong, and Lin Biqin. Fast Learning algorithms for Time-Delay Neural Networks Phoneme Recognition. Proc. ICSP, pp. 730-733, 1998.
10. Chandrasekar C., and Sivarama Krishna Rao J. Y. Recognition of Consonant - Vowel (CV) units of Speech Indian Lnaguages. Proc. National Seminar on Information Revolution and Indian Languages, Hydrabad, pp. 22.1-22.6, Nov., 1999.
11. Ganga Shetty S.V., and Yagnanarayana B. Neural Network Models for recognition of Consonant - Vowel (CV) utterances. INNS-IEEE International Joint Conference on Neural networks, Washington DC, pp. 1542-1547, July, 2001.
12. Pusateri E., and Thong J.M. N-best List Generation using Word and Phoneme recognition fusion. 7th European Congress on Speech Communication and Technology (Euro Speech), September, 2001, Aalborg, Denmark.
13. Salomon J., King S., and Osborne M. Framewise phone classification using support vector machines. Proceedings in International Conference on Spoken Language Processing, Denver, 2002.
14. Mporas I., and Fakotakis N. Least Squares Support Vector Machine based Phoneme Recognition. SPECOM 2005, pp. 377-380, Patras, Greece, Oct. 2005.
15. Mporas I., Ganchev T., Zervas P., and Fakotakis N. Recognition of Greek Phonemes using Support Vector Machines. Springer, Berlin /Heidelberg, pp. 290-300, 2006.
16. Schafer R.W., and Rabiner, L.R. Digital Representation of speech signals. Proceedings of IEEE , vol. 63, N°4, pp. 662-677, April 1975.
17. Furui S. Cepstral analysis technique for automatic speaker verification. IEEE trans.
Acoust. speech and signal processing, vol. 29, pp. 254-272, April, 1981.
18. Lippmann R. P. An Introduction to Computing with Neural nets. IEEE Acoust. speech and signal processing Magazine, pp. 4-22, April, 1987.