CO I
h-
!± CP
<U T3
o E
ro <u
A COMPARATIVE ANALYSIS OF SERBIAN PHONEMES: LINEAR AND NON-LINEAR MODELS
Danijela D. Protic
General Staff of the Serbian Army, Department of Telecommunications and Information Technology (J-6), e
Centre for Applied Mathematics and Electronics, Belgrade
DOI: 10.5937/vojtehg62-5170
FIELD: Telecommunications ARTICLE TYPE: Original Scientific Paper S
ARTICLE LANGUAGE: English
T3
n a
a e
e
e
>
Summary:
This paper presents the results of a comparative analysis of Serbian phonemes. The characteristics of vowels are quasi-periodicity and clearly visible formants. Non-vowels are short-term quasi-periodical signals having a low power excitation signal. For the purpose of this work, speech production systems were modelled with linear AR models and the corresponding non-linear models, based feed-forward neu- ^ ral networks with one hidden-layer. Sum squared error minimization as well as the back-propagation algorithm were used to train models. The selection of the optimal model was based on two stopping criteria: the normalized mean squares test error and the final prediction error. The Levenberg-Marquart method was used for the Hessian matrix calculation. The Optimal Brain Surgeon method was used for pruning. The generalization properties, based on the time-domain and signal spectra of outputs at hidden-layer neurons, are presented.
Key words: AR model; neural networks; speech.
Introduction
For several years now, neural network (NN) models have enjoyed wide popularity, being applied to problems of regression, classification, computational science, computer vision, data processing and time series analysis (Haykin, 1994). They have been also successfully used for the identification and the control of dynamical systems, mapping
o
X
o >
o (N
0¿ ljü 0£ ZD O O
-J <
o
X
o
lu
I— >-
Q1
£
w <
-j
O ■O
X lu I— O
o >
the input-output representation of an unknown system and, possibly, its control law (Narendra, Parthasaranthy, 1990). Perhaps the most popular to date artificial neural networks (ANN) in speech recognition is the multilayer perceptron (MLP) which organizes non-linear hidden units into layers and has full weight connectivity between adjacent layers (Sainath et al., 2011). In training, these weights are initialized with small random values, which are adjusted to obtain the desired task by a learning procedure (Pamucar, Borovic, 2012), (Milicevic, Zupac, 2012). Many training algorithms are based on the gradient descent (GD) or the back-propagation algorithm (BPA) which is one of the most broadly used learning methods (Silva et al., 2008), (Wu et al., 2011), with input data and the target (predicted output). It uses an objective function E (error/cost/loss function) in order to assess the deviation of the predicted output values from the observed data. Problems concerned with MPLs relate to the random weight initialization and the objective function that is non-convex, which can stick training in poor local minimum. The pre-training allows much better initial weights, and resolves the first problem addressing with MLP estimate (Sainath et al., 2011). However, feed-forward neural networks (FNNs) prove to be very successful for solving both these problems, based largely on the use of the BPA and improved learning procedures, which include better optimization, new types of activation functions, and more appropriate ways to process speech. This also stands for acoustic modelling in speech recognition, sub-word and world level modelling (Mikolov et al., 2012), large vocabulary speech recognition, coding and classification of speech (Collobert, 2008), segmentation and word boundaries (Riecke et al., 2009), (Shahin, Pitt, 2012), as well as perception of boundaries in acoustic and speech signals (Mesbahi et al., 2011). According to Bojanic and Delic (2009), FNNs can also model the impact of emotion to the variation of speech characteristics on the level of fundamental frequencies of phonation (pitch), segmentation (changes in articulation quality), and intra-segmental level (general voice quality, whose acoustic correlates are glottal pulse shape and distribution of its spectral energy).
Serbian language belongs to a small group of tonal languages. For these languages, many successful identification techniques based on FNNs and character-level language models are commonly used. Unlike those with a striking accent, in which a syllable may simply be stressed or not, and where minimal pairs of words differ by changes in voice pitch during the pronunciation, in the Serbian language a different accent may indicate a difference in morphological categories (Secujski, Pekar, 2014). From 1999 to 2010, scientists from Serbia were engaged in the AlfaNum project in order to resolve some problems related to Serbian speech such as phoneme-based continuous speech recognition, text-to-speech synthesis, lack of databases, etc. The results were speech databases and morphological dictionaries of the Serbian language (Delic, 2000), as well
as numerous published papers and books related to the speech technologies (Delic et al., 2010), (Pekar et al., 2010), machine learning (Kupusi-nac, Secujski, 2009), speech synthesis (Secujski et al., 2002), and speech recognition (Pekar et al., 2000). During the same period, Markovic et al. (1999) analyzed Serbian vowels (a, e, i, o, u) and spoken digits (0, 1, 2,... 9) and detected the abrupt changes in speech signals. They have presented the results obtained from natural speech with natural and mixed excitation frames, and based on robust recursive and non-recursive approaches and the non-linear Modified Generalised Likelihood Ratio (MGLR) algorithm for the identification of non-stationarity of speech. In 2002, Arsenijevic and Milosavljevic explored non-linear models for consonant processing. In 2003, they also presented the MGLR algorithm based on FNNs. As it turned out, labial and dental consonants were significant for the articulation of voice and understanding of speech that was essential for synthesized speech, and primarily related to its intelligibility and naturalness. Protic and Milosavljevic (2005) have presented the results on the generalization properties for various classes of linear and non-linear models. They have also analyzed the variations of test errors caused by the selection of models and modelling mode conditions (Pro-tic, Milosavljevic, 2006). In their research, they used the acoustic model and Gaussian noise to evaluate the impact of noise on speech recognition, recognition of phonemes of one or more speakers, comprehension, and performance evaluation.
This paper presents a comparative analysis of Serbian vowels (a, e, i, o, u) and non-vowels, the voiceless and sound sonant and consonants (labial, dental, anterior and posterior palatal). Men and women pronounced phonemes, in the context of words or isolated ones. The AR model parameters as well as the specific structure of FNNs were determined during the training, which was based on the BPA. The Levenberg-Mar-quart (LM) method was used to calculate the Hessian matrix and the Optimal Brain Surgeon (OBS) was enforced to prune the network parameters (Jing, 2012). The stopping criteria were reaching minima of normalized sum squared test errors (NSSEtest) and final prediction errors (FPEs). A novel method for multidimensional scaling, based on distance measure was developed for generalization properties testing. The results of the spectral analysis were also presented. Speech signals were represented by their spectro-temporal distribution of accoustic energy, the spectrograms. Finaly, NSSETEST i FPEs were compared.
The paper is organized as follows. The following chapter deals with models for speech signal prediction. The third chapter describes speech signals. The methodology and the results are presented in Chapter four and the last chapter is the conclusion of the paper. The appendix consists of MATLAB algorithms for processing techniques.
CO I
h-
!± cp
<D T3 O
E
ro <u
o
T3 c ro
ui <u E <u c o
.c p
c <5
<d ot
M—
O \r>
ro c ro
<u >
2
ro p
E o
o <
d
O cl
Models
If voice is only one signal for speech prediction, linear Auto-Regressive (AR) two-pole model is usable to minimize the prediction error, and to model a speech production system. If the glottal signal is also available, the AR model with eXtra input (ARX) can be used. In addition, a Moving Average (MA) error correction model may also be taken into account, although it enters some instability in the learning processes and the instability of a model is possible if the error value is high. However, a fully connected FNN gives the best results, because it may prune parameters one by one, up to the partially connected structure, which gives the error minimum (Ljung, 1987). Linear models are very suitable for the purpose of speech signal processing when the structural simplicity of the model is an alternative to the training time or the minimal processing error. Nonlinear models are more complex but also more accurate than linear ones and, consequently, they accurately approximate transfer functions to a higher degree.
Linear AR model
Linear AR two-pole models for approximately (2*n+1)*500Hz, n = 0, 1, ... poles are sutable for speech system modeling. For the purpose of this research, a 10-pole AR was used, which will be presented later in the paper. The following expression determines the AR model
AR:
y O) + a y (n -1) +... + anay (n - na ) = e (n) (1)
y(n) is a speech signal sample, a¡ (i'=1...na) are the AR parameters, e(n) is an error that contaminates the speech signal with white (temporary independent) or coloured (temporary dependent) noise (Park, Choi, 2008).
Non-linear model
FNNs with one hidden-layer are mathematically expressed in a form
( q ( m \ ^
+ Wf 0
yt (w, W ) = F ^Wf w]lzl + w
J 0
V ]=1 v I=1
(2)
where y is output, Zj is input, w and W are sinaptic weight matrices, f i Fj are the activation functions of the hidden layer and the output layer, respectively. q and m represent the number of elements in the hidden
layer and the input layer, respectively. In many fundamental network models, the activation functions are of a sigmoid or logistic type, but for the networks used here, the activation function is tangent-hyperbolic (tanh).
tanh(z'):
Speech signals
CO I
h-
!± cp
<U
i -21
1 - e °
1 + e -2' Eg
T3 c ro
ro <u
The speech production system consists of the lungs, the vocal cords, and the vocal tract. The lungs are the source of airflow and pressure, the vocal cords open and close periodically to produce voiced speech thus converting the airflow from the lungs to voice (glottal flow), and the vocal me tract consists of a set of cavities above the vocal cords. It is an acoustic <u filter. At the output of this filter, the sound radiates to the surroundings through the lips and the nostrils. The main characteristic of vowels is the stationarity over the long-term. This feature allows the estimation of models having the minimum of estimation error. The excitation signal is quasi-periodic and of high power because the airflow from the lungs encounters a small diameter of aperture of the vocal cords. For non-vowels, stationarity is shorter, and the model evaluation is difficult. The excitation signal is noise or a mix of noise and it is of less strength because the opening between the vocal cords is high. It is well known that the analysis of the vibrating vocal cords during phonation presents a challenge because the larynx is not easily accessible. However, a non-invasive method such & as electroglottography (EGG) is widely used to determine the glottal signal, I and the resulting electroglottogram gives useful information for modelling. c The excitation of non-vowels is the same as that of vowels, but the vocal cords do not vibrate. Figure 1 presents speech and the EGG signal. Figure :o 2 presents 4000 samples of the vowel 'a', and the corresponding training 2 and testing sets used for the purpose of this research. Figure 3 presents the consonant 's' and the vowel a.
<5
<D OT
M—
O </)
<u >
Q
CL
0.: 0
0 50 100 150 200 250 300
0.
0
0 50 100 150 200 250 300
Speech signal
EGG signal.
Figure 1 - Speech and the EGG signal Slika 1 - Govorni i EGG signal
dD—
0 500 1000 1500 2000 2500 3000 3500 4000 0 100 200 300 400 500 600 700 800
600
Figure 2 - Vowel a, training and test set Slika 2 - Vokal a, trening i test skup
i
0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1
Figure 3 - Consonant s and vowel a Slika 3 - Konsonant s i vokal a
For the testing purpose, 10 men and women pronounced all the vowels. The recorded analog signals were afterwards sampled with 8kHz and 10kHz frequencies. For the purpose of this paper, the results were given for the signals sampled with frequency fs=8kHz. The signals consisted of 4,000 samples. Each signal was divided into two equal parts. The training sets (800 samples) were chosen from the first 2,000 samples while the testing sets (600 samples) were parts of the other 2,000 samples. The resulting sets were normalized by the dscale function (A.1), to have a zero mean and a variance equal to one (Haykin, 1994). This pre-processing removes offset, variance and correlation of the input data. For testing the non-vowels, analog signals were sampled with fs=22050Hz. The phonemes were also pronounced in the context of words, or out of it, isolated. The Serbian phonemes were sorted in the following way (1) voiced sonant (j, l, lj, m, n, nj and r), (2) voiced consonants (f, c, s, t, c, s, h, k, b, p), (3) unvoiced sonant (v), and (4) unvoiced consonants (d, d, dz, z, z, and g).
0
0
0
100
200
400
500
600
700 800
0
100 200
400
500 600
700 800
Model learning
Training
For the training sets, FNN and AR models were trained. Training was carried out by changing the parameters based on the BPA. The LM approximation of the Hessian matrix was used (Svarer, 1995), (Le Cun et al., 1989). The optimal step size of the error changing was approximated by a Taylor series (Haykin, 1994), (Svarer, 1995). See (3).
E = E0 + f M] Su +1 SuT Su +...
0 l du J 2 du2 The Gauss-Newton approximation of an error is given with (4)
E T „ i 3u
E - E o +
Su + — Su T H Su 2
(3)
(4)
E is an error fuction approximation, Eo is its value in the point of approximation, u is the parameter vector, wjk and W are sinaptic weights, Su is the parameter deviation of u, and H is the Hessian matirx.
u = [«i, u 2,..., un ]
dE dE
dE du
du1 du 2
dE du„
H =
d2 E du2
d2 E d2 E d2 E
du1 d2 E du1du 2 d2 E du1du d2 E
du 2 du1 du 2 du 2du
d2 E d2 E d2 E
dundu1 dundu2 du2„
The error minimum and the estimated parameters are given with the formulae
Si = u *
u
h -i dE = o
du
* H -i dE u = u - H -
du
t
n
n
The number of the calculations for the Hessian matrix inverse is computer demanding (for a matrix of the dimension nxn, the numer of calculations is ~ n3). The LM algorithm accelerates the process of matrix estimation, as it is described by the following expressions
dE
u.
u —-
dui dE du 2
dE„
2a
\
du,
u, = u,
—-u p J
dEr
2a,
du 2
Hj
— u,.
p
u is the /h parameter estimation. For the square error function
d2 EH
=-S j2—Z» — = -S j
dW-
p
dV
)
dw
j j
p
(5)
f.
d 2 E 2
U tra in s
d2 Erm
dW I
p M
2 S
p iM
v -
dV,M w J
\ 2 J
2
d 2V)
— dh-
dh)
bk
(6)
(7)
This approximation implies the non-corelation of the input and the error (Z»—O)) that may include the non-modeled dynamics, which can be reduced by increasing the order of the model and may represent the measurement noise in the output data (Jing, 2012). It also ensures the correct direction of error estimation (Silva et al., 2008). FNN training stops at reaching the minimum of training error (Efra/n) or after 500 parameter changes. FNN training lasts from a few minutes to half an hour, which depends on the complexity of its structure. The initial values of parameters are random. For the FNN training, nnarx (A.2) and marq (A.3) are used. AR-10 is also trained by nnarx. The output yi is predicted based on its p previous values (8)
yt = S aky, — k 1 * k < p
k = 1
(8)
The formant characteristics of vowels and the distribution of formant frequencies determine the parameters of the model. The AR model is
*
stable, simple and, considering computer recourses, not very demanding so it can model spectral envelope to make the spectrum of residuals flat if p is sufficiently large. The prediction optimization is based on the minimum squared error (MSE) criterion; its partial derivates by parameters must equate zero.
f v \2
e2
E = Eei = El y> + E k
i n \ k=1
At the frequency domain, the modelled signal spectrum tends to the original signal as p increases. It becomes computer demanding and takes a long time, but the results are more accurate. A criterion for the optimization is the threshold criterion
V
1 - VV-1 <8
vv
Vp-1 and Vp are the normalized prediction errors for p-1 and p, and 8 is the threshold. Tipically, the number of coefficients is 8, 10, or 12. These models shift lower formants by adding biases to formants, which have high-energy value. It creates problems in analyses of male speech, considering that the basic frequencies of those signals are much lower than the basic frequencies of signals spoken by women or children.
Pruning
The parameters of trained FNNs were OBS pruned (Haykin, 1994), (Norgaard, 2001). The full Hessian matrix is calculated iteratively (Svarer, 1995). The error change is given by the formula
8E - - 8r H8 2
8u is a parameter change. The pruning of the parameter um to zero requires that
8um + um = 0
m m
which corresponds to e m,8i + um = 0
em is the unit vector, and is of the same dimension as Su. The goal of this methodology is to prune the parameter um, which would cause the minimum increase in the error E. This gives LaGrange's equality
La = 2 S 1 USa + 4^ + um)
4 is a LaGrange multiplier. If
dL
a(su)
Sa1 H + Ae 1 = 0
the Hessian matrix is a positive definite, and it is possible to find its inverse as follows
Sa = —AH _1e m
4 um
e m He m
following
SU = — Tum1 H—1e m
el H "1e m m
The main criterion for stopping the pruning algorithm is to achieve the error minimum. The method that determines the balance between too many and too few parameters takes into account the number of parameters, the size of the training set, the Hessian matrix size, the correlation of input data, etc. It is based on the available data, which enables adjusting the model parameters to the optimum. There are various algorithms for model optimization. One presented here stops pruning when the generalization error (Egen), the smallest error determined based on the independent set of data having the same distribution as the training set, reaches the minimum. The method requires large training and testing sets, but it is widely used, and gives good results. Akaike (1969) developed a method for the approximation of Taylor's series expansion of learning (Etearn) and generalization (Egen) errors (Ljung, 1987), (Larsen, 1993), (Hansen, Rasmusen, 1994), (Kashyap, 1980). The errors are shown in Figure 4 and given by expressions (9) - (10).
Error E*(u(0))
E* E(0)
\ \ \ \ \ \ \ \ /
E \ Egen E E learn \ \ / \ / \ / \ / \ / \ / \ / s. / s / \ / / \ / 'N^ / '"s.. s s /' / / / / /' /' /' /■ s' y
u(0) Parameters U*
Figure 4 - Approximation of Eearn and Egend Slika 4 - Aproksimacija Eearn i Egend
Eiearn = Eo ] SU + 2SuTHSa + o(|Su113
(9)
Egen = E' +
i dE' ^ T
3u
SU * + 1 SU *T H *SU * + oil Su *"3
(10)
Su* is the vector of the parameter changes on a minimum of Egen, dE/du and dE*/du are the first derivates of the given functions, respectively. H and H* are corresponding Hessian matrices, o(||...||)3 is a part of the Taylor series which equals zero. The first derivates of u0 i u*, are also equal to zero. Eearn is
EIearn = E o + 2 Sa T HSu
Egen = E * + 2 Su *T H *SU *
Etest is equal to Egen(u0). The problems that arise here are unknown values of E*, u* and H*. The following assumptions imply that the difference
between the values of u0 and u are small, and the FNN is well-trained. It is also assumed that the second derivate of Egen can be equated with the second derivate of the training error, so Egen is given with the formula
E,
gen
E
+—¿u 2
*t
h ¿u *
Akaike's estimate of FPE provides the way to estimate Egen from the given FNN structure, if the numer of parameters is known (Akaike, 1969). If the unknown value of noise variance can be removed from E* then
f
E gen (u 0 )
1 +
N
A
M
p ;
N
— E
\ learn
(» 0 )
--
NM is the dimenson of the parameter vector, and p is the size of the training set. Egen can be computed when it is necessary to compare different structures of neural networks, if the training sets are the same (Haykin, 1994), (Svarer, 1995), (Akaike, 1969). It follows that
1 + Nm
K E = —
FPE learn f
P
— _ Nml
E,
learn
p
nm < P
KFPE is the FPE coefficient. As the number of parameters increase Etearn decreases to zero. In addition, KFPE increases from one to ^ when the ratio N^/p changes from zero to one. Also, the parameter change always leads to the point of the Egen, because it exists. Furthermore, the minimum value of Efpe exists within the limits determined by the number of parameters, and pruning stops when EFPE reaches its minimum. It should be noted that Akaike's criterion showed some inconsistency related to the determination of AR model orders when there is a Gaussian noise and if NM<<p. The lower limit of this relation is 0.156 (Kashyap, 1980).
To compare AR and NNAR models for non-vowels, the FPE gain is introduced (10)
Gfpe = 10 log
E FPE
(11)
CTT)
E is the normalized sum of errors (NSE) and N is the size of the training set
1 N e=N
i=1
\T , J f i N \
FPE = N + d
N - d
N Z ( - y«)2
V
¿=1
(12)
y(i) is /h speech sample, y(i) is its estimated value and d is the number of model parameters.
Validation
Validation is performed for all the vowels and all speakers on the independent training and test sets. The results presented here are given for the structures that are selected to give the minimum error values for one of the criteria
1) minimum NSSEtest and
2) minimum FPE.
The results are presented in the following chapter.
Results
Vowels
For the training sets of vowels, the linear 10-pole AR model (AR-10) as well as FNNs with 10 inputs, 1 output, and 3, 5, 7, 9 or 11 neurons in the hidden layer are trained. The obtained structures are OBs pruned, to a maximum of 20 iterations retraining at each rejection of parameters. MATLAB function nnprune (A.4) is used. The results are NSSE for the training set (NSSEtrain), the test set (NSSEtest) and the FPE for each parameter. Pruning of the structures presented in this paper lasted from a half an hour up to four hours. Validation is carried out by the MATLAB function nnvalid (A.5). Along with the errors of the FNNs, the NSSE for AR-10 (NSSEar) is also calculated. The process of training and testing the AR-10 model lasted to a maximum of 10s. Table 1 shows the minimum error values FNN and AR-10 for all the vowels and all the speakers.
Table 1 - Minimum values of NSSEtest, FPE and NSSEar Tabela 1 - Minimalne vrednosti NSSEtest, FPE i NSSEar
NSSEtest NN FPE NN NSSEar
a2 0,0038 10-7-1 0,0013 10-11-1 0,0605
a a3 0,0089 10-13-1 0,0030 10-13-1 0,0152
lu § a6 0,0479 10-11-1 0,0025 10-13-1 0,1057
o > a7 0,0067 10-5-1 0,0010 10-13-1 0,0109
a8 0,0383 10-9-1 0,0012 10-13-1 0,0817
e2 0,0099 10-13-1 0,0023 10-13-1 0,0509
e e3 0,0067 10-7-1 5,12e-04 10-13-1 0,0147
lu § e6 0,0083 10-7-1 0,0011 10-13-1 0,0416
o > e7 0,0065 10-9-1 0,0016 10-13-1 0,0147
e8 0,0026 10-13-1 0,0040 10-13-1 0,0840
i2 0,0045 10-9-1 4,17e-04 10-13-1 0,0129
1 1 1 i3 0,0044 10-3-1 4,34e-04 10-13-1 0,0132
LU g i6 0,0021 10-13-1 2,44e-04 10-13-1 0,0135
o > i7 0,0022 10-9-1 6,99e-04 10-13-1 0,0024
i8 0,0047 10-5-1 7,10e-04 10-13-1 0,0162
o2 0,0015 10-13-1 1,85e-04 10-13-1 0,0061
o o3 5,10e-04 10-13-1 9,40e-05 10-11-1 0,0015
lu § o6 6,51 e-04 10-13-1 1,14e-04 10-13-1 0,0054
o > o7 9,33e-04 10-13-1 1,41e-05 10-11-1 0,0010
o8 0,0027 10-5-1 3,74e-04 10-7-1 0,0095
u2 9,88e-05 10-11-1 1,98e-04 10-13-1 1,34e-04
u u3 2,10e-04 10-9-1 5,07e-05 10-11-1 5,13e-04
lu § u6 3,91 e-04 10-7-1 1,15e-04 10-13-1 5,51 e-04
o > u7 3,41 e-04 10-7-1 1,13e-04 10-11-1 4,06e-04
u8 1,96e-04 10-7-1 8,34e-05 10-13-1 1,80e-04
Non-vowels (voiceless sonant, voiceless consonant, sonar sonant, and sonar consonant)
Table 2 shows the results of the analysis of non-vowels pronounced by men and women. The FPE gains for AR (Gfpe AR) and NNAR (Gfpe NNAR) are determined.
Table 2 - Gfpe for isolated phonemes Tabela 2 - Gfpe za izolovane foneme
Gfpe [dB] AR Gfpe [dB] NNAR GfpeNNAR - GfpeAR [dl
Women Men Women Men Women Men
B 34,6519 30,167 38,5847 34,3293 3,9328 4,1623
C 2,2075 2,9722 5,6637 6,559 3,4562 3,5868
c 11,5334 7,0614 15,5047 10,9525 3,9713 3,8911
c 8,8546 10,4918 13,1923 14,0135 4,3377 3,5217
D 31,3806 25,9894 37,1691 30,9179 5,7885 4,9285
B 29,4399 14,9543 34,0653 19,4483 4,6254 4,494
DZ 13,6488 12,789 17,5563 16,8545 3,9075 4,0655
F 3,5165 11,365 6,7754 15,1115 3,2589 3,7465
G 28,1664 19,1006 35,7757 24,087 7,6093 4,9864
H 11,064 9,7711 14,2458 12,8508 3,1818 3,0797
J 26,3422 21,7075 30,025 24,9229 3,6828 3,2154
K 19,1537 10,6407 22,8868 14,1928 3,7331 3,5521
L 29,5687 31,0459 32,8997 35,0852 3,331 4,0393
LJ 31,5999 28,8233 35,2027 32,4762 3,6028 3,6529
M 33,4249 36,0233 37,3764 39,9362 3,9515 3,9129
N 33,3766 36,6123 37,4727 40,2027 4,0961 3,5904
NJ 33,6963 34,1175 38,1354 38,2625 4,4391 4,145
P 11,548 34,1175 14,9509 38,4441 3,4029 4,3266
R 24,8785 28,8893 28,4164 33,3224 3,5379 4,4331
S 2,8275 4,9219 6,081 8,1239 3,2535 3,202
S 10,9265 11,2751 14,222 15,1396 3,2955 3,8645
T 8,351 7,3746 12,3908 11,1201 4,0398 3,7455
V 23,8889 25,068 27,2998 28,3466 3,4109 3,2786
Z 10,1461 14,3512 13,7734 18,4116 3,6273 4,0604
Z 12,441 12,4569 15,9663 15,808 3,5253 3,3511
The average Gfpe for the NNAR model is approximately 4dB higher than Gfpe for the AR model, indicating better properties of NNAR as compared to the same order AR model. Figure 5 shows G fpe for the phonemes that were pronounced out of the context of words (isolated).
C2D
FPE Gain
Isolately
□ Female
□ Male
b c
d d 7dz f g h j k l lj
nj p r
Figure 5 - FPE Gain for isolately pronounced phonemes S/ika 5 - FPE pojacanje za izolovano izgovorene foneme
From Table 2, the following grouping of phonemes can be noticed: sonar sonant (j, l, lj, m, n, nj, and r), which always have a high value of Gfpe, and sonar consonant (f, c, s, t, c, s, h, and k), which do not have it, excluding "b" and "p". The voiceless sonant "v" has a high value of GFPE, while voiceless consonants (d, d, dz, z, z, and g) mostly have an average Gfpe, depending on the model or the gender of a speaker.
Table 3 and Figure 6 show the results of the analysis for non-vowels that were spoken in the context of words.
Tab/e 3 - Gfpe for phonemes pronounced in the context of words Tabe/a 3 - Gfpe za foneme izgovoren u kontekstu reci
Gfpe [dB] AR Gfpe [dB] NNAR GNNAR - GAR
Content Women Men Women Men Women Men
B 32,6346 33,0303 36,0158 36,1753 3,3812 3,145
C 2,6291 1,7716 5,7897 5,2355 3,1606 3,4639
C 12,2942 9,0452 16,2301 12,6387 3,9359 3,5935
C 9,442 10,5968 13,6114 14,7664 4,1694 4,1696
D 33,9978 30,7685 37,8104 35,2031 3,8126 4,4346
B 14,3192 13,5345 17,7639 16,9531 3,4447 3,4186
7
6
5
4
3
2
0
mn
s
v
z
gfpe [dB] AR gfpe [dB] NNAR GNNAR - GAR
Ccntent Wcnen Men Wcnen Men Wcnen Men
DZ 14,9507 0 17,5563 0 2,6056 0
F 0 0 0 0 0 0
G 33,1103 27,2414 36,7543 30,8057 3,644 3,5643
H 10,365 7,9697 13,5085 11,018 3,1435 3,0483
J 21,1887 27,3908 24,6641 30,9087 3,4754 3,5179
K 11,6569 14,5669 15,9135 18,2566 4,2566 3,6897
L 25,7282 30,6369 30,3882 34,2097 4,66 3,5728
LJ 21,0459 33,7669 26,0215 37,3985 4,9756 3,6316
M 33,4249 31,5708 37,0044 34,8602 3,5795 3,2894
N 26,9488 33,0333 30,6018 36,9415 3,653 3,9082
NJ 27,4645 30,1661 32,5246 35,1878 5,0601 5,0217
F 6,1535 11,4462 11,3976 14,9627 5,2441 3,5165
R 26,3231 25,6694 30,0101 30,2615 3,687 4,5921
S 1,9138 3,5753 5,0039 7,017 3,0901 3,4417
S 13,9972 11,201 17,5547 14,9197 3,5575 3,7187
T 11,3133 4,3862 15,359 7,5557 4,0457 3,1695
V 29,2793 33,852 32,6862 37,532 3,4069 3,68
Z 13,8724 23,0607 17,2613 27,119 3,3889 4,0583
Z 0 0 0 0 0 0
FFE Gain
Phonemes takenfrom contents cf words
k l lj Fhcnenes
□Femle □Mêle
Figure 6 - FFE gain fcr ncn-vcwels pronounced as parts cf words Slika 6 - FFE pcjacanje za nevckale izgcvcrene u delcvina reci
6
5
4
^ 3
2
0
a dz f
h
c
nn
nj
v z
Generalization properties
To test the generalization of the given models, the parameters of which were estimated based on the training set of one speaker, the testing was carried out on the sets of other speakers. The minima of NSSEtest and FPE, as well as the matrices of the mean(NSSETEST) were calculated for all vowels. Figure 7 shows the mean(NSSETEST) for the vowel 'a' and the FNN structure 10-3-1. The arrows mark points of error jumps, which are evident for 5-8, 13-18, and 21-25 parameters remained in FNN after pruning.
0.6 0.55 0.5 0.45 0.4
w
H 0.35
S
§ 0.3
s
0.25 0.2 0.15 0.1
Figure 7 - Mean(NSSETEST) Slika 7 - Srednja vrednost NSSEtest
To measure the variability, a new distance measure, based on FNN is defined. Two models based on the speech signals are given with formulae (13), (14)
yht = g {y,,t-1, St-1 (y, ))+£„, i = 1,2
y ,t = gw,. iy, ,t-1> st-1 (y,)) + £Ut i =1,2
(13)
(14)
For the parameters that are estimated based on the corresponding training sets the variable
C ( , Y) = It ( - g^, (it-1, St-1 iy )))2
i=1
(15)
presents the MSE of the residuals of the signal Y for gWYi (a transfer function of FNN trained with Y), giving the expression for the distance between two signals Y-i and Y2
D(Yi,Y2 ) = 2
lo c(wYi, Y2)+io CyJ" 0g CWWY™^"] 0g C (Wy2, Y2)
(16)
Originally developed for regression problems, the MSE function is obtained by the maximum likelihood principle assuming the independence and Gaussianity of the target data (Bishop, 1995) (Silva et al., 2008). However, although most classical approaches in speech processing are based on linear techniques, which rely on the source-filter model, these linear approaches cannot capture the complex dynamic of speech. It has been shown that the Gaussian linear prediction analysis cannot be used to extract all dynamical structures of real time speech series (Khanagha et al., 2012), (Little et al., 2006). However, in this particular case, the difference between a speech signal sample and its predicted value is temporary independent, so e(n) in the model (1) is assumed to be a zero mean white Gaussian process (Markovic et al., 1999) and the prediction error £ is given with the formula
£ = ek = yk + Z a iyk-i
i=1
(17)
In their work, Stanimirovic and Cirovic (2008) describe an adaptive algorithm for the adaptive classification of speech and pause, and describe noise and the residuals, which are Gaussian, in this case this states for both Y1 and Y2 (Park, Choi, 2008). Table 4 shows the distances
DNSSETEST DFPE and DAR-10.
Table 4 - Dnssetest, Dfpe i Dar - 10 Tabela 4 - Dnssetest, Dfpe i Dar - 10
Dn
Df
Dar - 1
a2 a3 a6 a7 a8 a2 a3 a6 a7 a8 a2 a3 a6 a7 a8
a2 0
a3 2,58
a6 1,77
a7 1,43
a8 1,24
2,58 1,77 1,43
0 1,67 2,32
1,67 0 2,66
2,32 2,66 0
1,85 1,51 2,19
1 ,24 1 ,85 1 ,51 2,19 0
0 2,29 1 ,71 1 ,54 1 ,01
2,29 0 1,60 2,76 1,75
1 ,71 1 ,60 0 2,56 1 ,44
1 ,54 2,76 2,56 0 2,11
1 ,01 1,75
0 1 ,26
1,44 0,89 2,11 0,64 0 0,58
1 ,26 0 0,74 1 ,40 1 ,37
0,89 0,74 0 1,56 1,17
0,64 1 ,40 1 ,56 0 1 ,28
0,58 1 ,37 1,17 1 ,28 0
e2 e3
e6
e7
e8 e2 e3 e6 e7 e8 e2 e3 e6 e7 e8
e2 0
e3 3,92
e6 2,82
e7 1,93
e8 1,17
3,92 2,82 0 1,46
1,46 2,24 3,67
0 1,96 0,05
1,93 2,24 1,96 0 0,32
1,17 3,67 0,05 0,32 0
0 3,90 1 ,95 0,47 0,86
3,90 0 2,46 2,60 3,77
1 ,95 2,46 0 2,18 1 ,80
0,47 2,60 2,18 0 1 ,25
0,86 0 3,77 2,42 1,80 2,02 1,25 0,59
0 1,15
2,42 0 0,36 1 ,29 1 ,93
2,02 0,36 0 0,97 1 ,41
0,59 1 ,29 0,97 0 1,17
1,15 1 ,93 1 ,41 1,17 0
C2D
o
X
o >
o (N
0¿ ljü 0£ ZD O O
-J <
o
X
o
lu
I— >-
Q1
£
w <
-j
O ■O
X lu I— O
o >
i2 i3 i6 i7 i8 i2 i3 i6 i7 i8 i2 i3 i6 i7 i8
12 0 1,16 2,20 0,88 1,37 0 1,16 2,20 0,88 1,37 0 0,00 2,63 0,91 1,20
i3 1,16 0 3,04 1,36 2,29 1,16 0 3,04 1,36 2,29 0,00 0 2,59 0,86 1,21
16 2,20 3,04 0 2,17 2,30 2,20 3,04 0 2,17 2,30 2,63 2,59 0 1,09 1,76
17 0,88 1,36 2,17 0 2,00 0,88 1,36 2,17 0 2,00 0,91 0,86 1,09 0 1,27
18 1,37 2,29 2,30 2,00 0 1,37 2,29 2,30 2,00 0 1,20 1,21 1,76 1,27 0
o2 o3 06 o7 08 o2 o3 06 o7 08 o2 o3 06 o7 08
"o2 0 3,44 3,08 1,16 1,74 0 2,83 2,29 2,98 2,00 0 1,10 1,16 0,56 0,54
o3 3,44 0 3,45 1,70 4,00 2,83 0 3,56 4,27 3,42 1,10 0 0,68 1,09 1,48
06 3,08 3,45 0 2,05 3,53 2,29 3,56 0 4,81 3,33 1,16 0,68 0 1,57 1,71
07 1,16 1,70 2,05 0 1,68 2,98 4,27 4,81 0 4,33 0,56 1,09 1,57 0 0,98
08 1,74 4,00 3,53 1,68 0 2,00 3,42 3,33 4,33 0 0,54 1,48 1,71 0,98 0
u2 u3 u6 u7 u8 u2 u3 u6 u7 u8 u2 u3 u6 u7 u8
"Ü2 0 5,10 1,16 0,23 1,00 0 1,78 1,33 1,18 2,33 0 0,68 0,99 0,18 0,79
u3 5,10 0 4,18 4,13 1,70 1,78 0 0,56 1,22 2,48 0,68 0 0,45 0,62 1,28
u6 1,16 4,18 0 0,63 1,10 1,33 0,56 0 0,25 1,21 0,99 0,45 0 0,34 0,80
u7 0,23 4,13 0,63 0 0,03 1,18 1,22 0,25 0 1,13 0,18 0,62 0,34 0 0,34
u8 1,00 1,70 1,10 0,03 0 2,33 2,48 1,21 1,13 0 0,79 1,28 0,80 0,34 0
For the purpose of this work, the signals within the FNN structure were also analysed. The training and pruning of FNN (10-3-1 structure), were based on the joined training set formed in the following way: the signal sets of vowels a3, a4, a6, a7, and a8 were 'glued' to the following one. The testing was done with the corresponding test set. The total length of the joined training set was 4,000 samples and the total length of the joined test set was 3,000 samples. The validation was performed over the independent vowel, in this paricular case it was the vowel 'a' that was pronounced by the second speaker (a2). Error jumps occured after 22, 13, and 5 parameters remained after pruning. The NSSEtrain, NSSEtest, and FPE are shown in Figure 8. The graph also shows the NSSear for AR-5, AR-10, and AR-15 models.
For each structure, the spectra of signals at the outputs of neurons in the hidden layer were also analysed. Figure 9 shows the spectra of the validation signal and the signals at the outputs of neurons in the hidden layer for 5, 13, and 22 parameters remained after pruning the 10-3-1 FNN. The spectra were calculated by Burg's method. It is evident that the spectra of the outputs of the hidden-layer group around the formant frequency of the validation signal. The signal range from one neuron shows strong grouping around one and less around other formant frequencies. It should be noted that 2nd neuron was rejected when 5 parameters remained. The FNNs whose total numer of parameters in the pruning exceeds 25 show overfitting in the assessment of the validation signal while those FNNs with 5 parameters remaining make good assessment, which indicated the existence of a non-linear structure with a minimal number of parameters.
In addition to the above, the cross-correlations up to the shift 30 of the given signals were also determined. The cross-correlation of two signals x1 and x2 is given by the following expression
xcorrx y (k) = Z x(n)y(n - k) (15)
The results of the cross-correlation analysis are shown in Figure 10.
fadWi ' "Epsmeesteransd
•031
C 5 -D 15 33 25 3d 35
0513152D 25 3D3E 315 faiffra
C51C152C 25 3C35 05tC152C 25 3C35
Figure 10 - Crosscorrelations Slika 10 - Kroskorelacije
Moreover, as shown in Figure 11, the cumulative sum of the absolute values of the cross-correlations was given for the clarity of results.
0 5 10
25 3D 3 0 5 101523 25 30 35
Figure 11 - Cumulative sum of the absolute values of cross-correlation signals Slika 11 - Kumulativne sume apsolutnih vrednosti kroskorelacionih signala
The average cross-correlation is shown in Table 5. The ratio of the first and the fifth element of cumulative sums (of the absolute values of cross-corelations) whith the number of signal shifts (1 and 5), give average values for a given magnitude. These values are in the range from 0.117- 0.387 for 13 and 22 parameters remained, which represents weak stochastic dependence. The value for five parameters is 0.636, which indicates a medium stochastic dependence.
Table 5 - Cumulative sum of cross-correlations at the outputs of hidden-layer neurons Tabela 5 - Kumulativne sume kroskorelacija na izlazima neurona skrivenog sloja
1st and 2nd neuron
1st and 3rd neuron
2nd and 3rd neuron
Parameters 1st av(xcorr) 1st av(xcorr) 1st av(xcorr)
remaining element nmax_5 element nmax_5 element nmax=5
5 - - 0,6714 0,63618 - -
13 0,3408 0,37804 0,2690 0,23198 0,1836 0,14376
22 0,4391 0,36744 0,1975 0,23222 0,0696 0,11702
Conclusion
This paper presents a comparative analysis of Serbian phonemes (vowels and non-vowels). The FNN and AR-10 models are trained and tested. The characteristics of vowels are long-term quasi-periodicity and power spectrum with clearly visible formants. Non-vowels are characterised by short quasi-periodicity and a low power excitation signal. The methodology of generalization enabled a choice of network architectures with improved properties, based on pruning and significant reduction of model parameters. Limited architectures are characterized by a minimal number of parameters within the given margins of errors. In order to review the discriminatory properties of the selected models, a new method for multidimensional scaling based on the measurement of distance is developed. The analysis of discrimination loss suggests that the FNNs have a much higher discrimination power, which makes them usable in a wide class of speech recognition usage. The spectral analysis shows a good corellation of the signals at the outputs of hidden-layer neurons and the input signal. The time-domain analysis indicates a week statistical dependence of these signals for the low ranks of cross-correlation (up to the fifth order). The analyses indicate a slight advantage of NSSEtest compared to FPE criteria. If training sets are short, the FPE is an acceptable criterion. The results indicate that the proposed FNN model, as well as a choice of architecture with the best generalization properties, provides high accuracy and an internally distributed structure that correspond to the natural time-frequency contents of input signals, as well as high discrimination properties for the same number of parameters, as compared to the traditional linear model.
Appendix
A.1 DSCALE_
[X,Xscale]=dscale(X) scales data to zero mean and variance 1. INPUTS:
X: Data matrix (dimension is # of data vectors in matrix * # of data points) OUTPUTS:
X: Scaled data matrix
Xscale: Matrix containing sample mean (column 1) and standard deviation (column 2) for each data vector in X.
A.2 NNARX_
Determine a nonlinear ARX model of a dynamic system by training a two-layer neural network with the Marquardt method. The function can handle multi-input systems (MISO).
[W1,W2,critvec,iteration,lambda]=nnarx(NetDef,NN,W1 ,W2,trparms,Y,U) INPUTS:
U: Input signal (= control signal) (left out in the nnarma case)
dim(U) = [(inputs) * (# of data)] Y: Output signal. dim(Y) = [1 * # of data] NN: NN=[na nb nk].
na = # of past outputs used for determining prediction nb = # of past inputs used for determining prediction nk = time delay (usually 1)
For multi-input systems nb and nk contain as many columns as there are inputs. W1,W2: Input-to-hidden-layer and hidden-to-output layer weights. If they are passed as [] they are initialized automatically
trparms : Contains parameters associated with the training (see MARQ), if trparms=[] it is reset to trparms = [500 0 1 0]. For time series (NNAR models), NN=na only. See the function MARQ for an explanation of the remaining input arguments as well as of the returned variables.
A.3 MARQ_
Train a two layer neural network with the Levenberg-Marquardt method. If desired, it is possible to use regularization by weight decay. Also pruned (ie. not fully connected) networks can be trained. Given a set of corresponding input-output pairs and an initial network
[W1,W2,critvec,iteration,lambda]=marq(NetDef,W1,W2,PHI,Y,trparms)
trains the network with the Levenberg-Marquardt method. The activation
functions can be either linear or tanh. The network architecture is defined by the
matrix 'NetDef which has two rows. The first row specifies the hidden-layer and
the second row specifies the output layer.
E.g.: NetDef = ['LHHHH'; 'LL---'] (L = Linear, H = tanh)
Notice that the bias is included as the last column in the weight matrices.
<5T)
INPUT:
NetDef: Network definition
W1: Input-to-hidden-layer weights. The matrix dimension is dim(W1) = [(# of hidden units) * (inputs + 1)] (the 1 is due to the bias)
W2: hidden-to-output layer weights, dim(W2) = [(outputs) * (# of hidden units + 1)] PHI: Input vector. dim(PHI) = [(inputs) * (# of data)] Y : Output data. dim(Y) = [(outputs) * (# of data)] trparms : Vector containing parameters associated with the training trparms = [max_iter stop_crit lambda D] max_iter : max # of iterations. stop_crit : Stop training if criterion is below this value lambda: Initial Levenberg-Marquardt parameter D: Row vector containing the weight decay parameters. If D has one element, a scalar weight decay will be used. If D has two elements, the first element will be used as weight decay for the hidden-to-output layer while the second one will be used for the input-to hidden-layer weights. For individual weight decays, D must contain as many elements as there are weights in the network. Default values are (obtained if left out): trparms = [500 0 1 0]
OUTPUT:
W1, W2 : Weight matrices after training
critvec: Vector containing the criterion evaluated at each iteration iteration: # of iterations
lambda: The final value of lambda. Relevant only if retraining is desired
A.4 NNPRUNE_
This function applies the Optimal Brain Surgeon (OBS) strategy for pruning neural network models of dynamic systems. That is networks trained by NNARX, NNOE, NNARMAX1, NNARMAX2, or their recursive counterparts. [theta_data,NSSEvec,FPEvec,NSSEtestvec,deff,pvec]=... nnprune(method,NetDef,W1,W2,U,Y,NN,trparms,prparms,U2,Y2,skip,Chat)
INPUT:
method: The function applied for generating the model. For example method='nnarx' or method='nnoe' NetDef, W1, W2, U, Y,trparms: See for example the function MARQ
U2,Y2: Test data. This can be used for pointing out the optimal network architecture is achieved. Pass two []'s if a test set is not available. skip (optional): See for example NNOE or NNARMAX1/2. If passed as [] it is set to 0.
Chat (optional): See NNARMAX1 prparms: Parameters associated with the pruning session prparms = [iter RePercent] iter: Max. number of retraining iterations RePercent : Prune 'RePercent' percent of the remaining weights (0 = prune one at a time) if passed as [], prparms=[50 0] will be used.
CO I
h-
!± cp
ui <D T3
o E
ro <u
o
T3 c ro
ui <u E <u c o
.c p
c <5
<d ot
M—
O \r>
ro c ro
<u >
2
ro p
E o
o <
d
O CL
OUTPUT:
theta_data: Matrix containing the parameter vectors saved after each weight elimination round.
NSSEvec: Vector containing the training error (SSE/2N) after each weight elimination.
FPEvec: Contains the FPE estimate of the average generalization error NSSEtestvec : Contains the normalized SSE evaluated on the test set deff: Contains the "effective" number of weights pvec: Index to the above vectors
A.5 NNVALID_
Validate a neural network input-output model of a dynamic system. I.e., a network model which has been generated by NNARX, NNRARX, NNARMAX1+2, NNRARMX1+2, or NNOE. The following plots are produced: o Observed output together with predicted output o Prediction error
o Auto-correlation function of prediction error and cross-correlation between the prediction error and input
o A histogram showing the distribution of the prediction errors o Coefficients of extracted linear models Network generated by NNARX (or NNRARX):
[Yhat,NSSE] = nnvalid('nnarx',NetDef,NN,W1 ,W2,Y,U) Network generated by NNARMAX1 (or NNRARMAX1):
[Yhat,NSSE] = nnvalid('nnarmax1',NetDef,NN,W1 ,W2,C,Y,U) Network generated by NNARMAX2 (or NNRARMX2):
[Yhat,NSSE] = nnvalid('nnarmax2',NetDef,NN,W1 ,W2,Y,U) Network generated by NNOE:
[Yhat,NSSE] = nnvalid('nnoe',NetDef,NN,W1 ,W2,Y,U) Network generated by NNARXM:
[Yhat,NSSE] = nnvalid('nnarxm',NetDef,NN,W1 ,W2,Gamma,Y,U) NB: For time-series, U is left out!
References
Akaike, H., 1969, Fitting Autoregressive Models for Prediction. Ann. Ins. Stat. Mat.
Arsenijevic, D., Milosavljevic. M., 2002, Analysis of Neural Network Models in Serbian Speech Consonants, Electronic Review, Faculty of Electrical Engineering, Banja Luka.
Bishop, C., 1995, Neural networks for pattern recognition. Oxford University Press.
Bojanic, M., Delic, V., 2009, Automatic Emotion Recognition in Speech: Possibility and Significance. Electronics, Vol.13, No.2, pp.35-40.
Collobert, R., Weston, J., 2008, A unified architecture for natural language processing: deep neural networks with multitask learning. In Proceedings of the 25th International conference on machine learning, pp.160-167. New York, NY, USA.
Delic, V., 2000, Speech Databases in Serbian Language Recorded with the AlfaNum Project. DOGS conference, pp.29-32, September 21st-22nd, 2000, Novi Sad.
Delic, V., Secujski, M., Jakovljevic, N., Janev, M., Obradovic, R., Pekar, D., 2010, Speech Technologies for Serbian and Kindered South Slavic Languages. Chapter 9 in the Shabtai, N. ed book Advances in Speech Recognition, pp.141-165.
Hansen, L.K., Rasmusen, C.E., 1994, Pruning from adaptive regularization. Neural Computation 6(6), pp.1223-1232.
Haykin, S., 1994, Neural networks: A comprehensive foundation. New York: Macmillan.
Kashyap, R.L., 1980, Inconsistency of the AIC Rule for Estimating the Order of AR Models. IEEE Transaction on Automatic Control. AC-25, pp.996998.
Khanagha, V., Yahia, H., Daoudi, K., 2011, Reconstruction of Speech Signals from Their Unpredictable Points Manifold, Nonlinear Speech Processing, 2011 7015, pp.1-7, Available at
http://hal.inria.fr/docs/00/64/71/97/PDF/KHANAGHA_Reconstruction_of_speech _from_UPM.pdf, Retrieved on January 22, 2014.
Kupusinac, A., Secujski, M., 2009, Part of Speech Tagging Based on Combining Markov Model and Machine Learning. Speech and Language. November 13th-14th, 2009, Belgrade.
Larsen, J., 1993, Design of Neural Networks, Ph.D. Thesis. Electronic Institute, DTH, Lyngby.
Le Cun, Y., Denker, J.S., Solla, S.A., 1989, Optimal Brain Damage. Advances in Neural Information Processing Systems 2, pp.598-605.
Little, M., McSharry, P.E., Moroz, I., Roberts, S., 2006, Testing the assumptions of linear prediction analysis in normal vowels. Journal of the Acoustic Society of America, 119, pp.549-558.
Ljung, L., 1987, System Identification: Theory for the User, Prentice Hall
Inc.
Markovic, M., Milosavljevic, M., Kovacevic, M., Veinovic, M., 1999, Robust AR Speech Analysis Based on MGLR Algorithm and Quadratic Clasifier with Sliding Training Set. In Proceedings of IMACS/IEEECSCC'99, pp.2401-2408.
Mesbahi, L., Jouvet, D., Bonneau, A., Fohr D., Illina, I. Laprie, Y., 2011, Reliability of non-native speech automatic segmentation for prosodic feedback. In SlaTE, 2011, Venice, Italy.
Milicevic, M.R., Zupac, Z.G., 2012, Objektivni pristup odredivanju tezina krierijuma. Vojnotehnicki glasnik/Military technical courier. Vol. 60, (No.1.), pp.39-56.
Mikolov, T., Sutskever, I., Deodoras, A., Le, H.S., Kombrink, S., Cernocky, J., 2012, Subword language modelling with neural networks. Unpublished.
Narendra, K.S., Parthasaranthy, K., 1990, IEEE Transactions on Neural Networks, 1, p.4.
co I
h-
!± cp
ui <D T3
o E
ro <u
o
T3 c ro
ui <u E <u c o
.c p
c <5
<u OT
M—
O in
ro c ro <u
2
ro p
E o
o <
(3 CL
o
X
o >
o <N
a: w 0£ ZD O o
-J <
o
X
o
lu
I— >-
CC
£
w <
-j
CD >Q
X lu I—
o
o >
Norgaard, M., 2001, Neural Network Based System Identification Toolbox, Version 1.2, Technical University of Denmark, Department of Automation Department of Mathematical Modelling, Technical Report 97-E-851.
Pamucar, S.D., Borovic, D.B., 2012, Optimizing models for production and inventory control using genetic algorithm. Vojnotehnicki glasnik/Military technical courier. Vol. 60, (No.1), pp.14-38.
Park, S., Choi, S., 2008, A constrained sequential EM algorithm for speech enhancement, Neural Networks 21, pp.1401-1409.
Pekar, D., Obradovic, R., Delic, V., Krco, S., Senk, V., 2002, Connected Words Recognition. DOGS conference, September 21st-22nd, 2002, pp.21-24, Novi Sad.
Pekar, D., Miskovic, D., Knezevic, D., Vujnovic Sedlar, N., Secujski, M., Delic, V., 2010, Chapter 7 in the Shabtai, N. ed book Advances in Speech Recognition, pp.105-122.
Protic, D., Milosavljevic, M., 2005, Generalizaciona svojstva razlicitih klasa linearnih i nelinearnih modela govornog signala, Festival informatickih dostignuca INFOFEST, Festivalski katalog, pp.247-258, Budva.
Protic, D., Milosavljevic, M., 2006, NNARX Model of Speech Signal Generating System: Test Error Subject to Modeling Mode Selection, Conference MIEL, IEEE Catalog, May 2006, pp.685-688, Belgrade.
Riecke, L., Esposito, F., Bonte, M., Formisano, E. 2009, Hearing illusory sound in noise: the timing of sensory-perceptual transformations in auditory cortex, Neuron 64, pp.550-561.
Sainath, T.N., Kingsbury, B., Ramabhadran, B., Fousek, P. Novak, P., Mohamed, A., 2011, Making deep belief networks effective for large vocabulary continous speech recognition, In Automatic Speech Recognition and Understanding, 2010 IEEE Workshop, 11-15 December 2011, pp.30-35, Waikoloa, HI.
Secujski, M., Pekar, D., 2014, Evaluacija razlicitih aspekata kvaliteta sintetizovanog govora. Available at
http://www.savez-slijepih.hr/hr/kategorija/evaluacija-razlicitih-aspekata-kvaliteta-sintetizovanog-govora-452/. Retrieved on February 16, 2014.
Shahin, A.J., Pitt, M.A., 2012, Alpha activity making world boundaries mediates speech segmentation, European Journal of Neuroscience, Vol.36, pp.3740-3748.
Silva, L., Marques de Sa, J., Alexandre, L.A., 2008, Data classification with multilayer perceptrons using a generalized error function. Neural Networks 21, pp.1302-1310.
Stanimirovic, Lj., Cirovic, Z., 2008, Digitalna obrada govornog signala, Retrieved from www.viser.edu.rs/download/uploads/2371.pdf Accessed January 24, 2013.
Svarer, C., 1995, Neural Networks for Signal Processing, Technical University of Denmark.
Wu, W., Wang, J., Cheng, M., Li., Z., 2011, Convergence analysis of online gradient method for BP neural networks. Neural Networks 24, pp.91-98.
UPOREDNA ANALIZA FONEMA SRPSKOG JEZIKA: LINEARNI I NELINEARNI MODELI
OBLAST: telekomunikacije
VRSTA C LANKA: originalni naucni clanak
JEZIK C LANKA: engleski
Sazetak
U radu je prikazana analiza karakteristika vokala i nevokala srpskog jezika. Vokale karakterise kvaziperiodicnost i spektar snage signala sa do-bro uocljivim formantima. Nevokale karakterise kratkotrajna kvaziperiodicnost i mala snaga pobudnog signala. Vokali i nevokali modelovani su linearnim AR modelima i odgovarajucim nelinearnim modelima koji su generi-sani kao feed-forward neuronska mreza sa jednim skrivenim slojem. U procesu modelovanja koriscena je minimizacija srednje kvadratne greske sa propagacijom unazad, a kriterijum izbora optimalnog modela jeste zau-stavljanje obucavanja, kada normalizovana srednja kvadratna test greska ili finalna greska predikcije dostignu minimalnu vrednost. LM metod kori-scen je za proracun inverzne Hessianove matrice, a za pruning je upotre-bljen Optimal Brain Surgeon. Prikazana su generalizaciona svojstva signala u vremenskom i frekvencijskom domenu, a kroskorelacionom anali-zom utvrden je odnos signala na izlazima neurona skrivenog sloja.
Uvod
Unazad nekoliko godina NN su primenjivane u procesima obrade podataka, pa samim tim i govornog signala. Znacajan napredak u ovoj oblasti krece se u pravcu ubrzanja konvergencije algoritama obucavanja. Pored izbora strukture NN, izbor prenosnih funkcija takode je veoma bitan. Nadzirano obucavanje sa ulaznim podacima i predefinisanim izla-zom zahtevaju koriscenje funkcije gubitaka ili greske za utvrdivanje od-stupanja ocekivane, prediktovane vrednosti od tacnih vrednosti podataka. Od mnogo primenjenih algoritama u radu je koriscen BPA, koji je istovremeno i najrasprostranjeniji algoritam obucavanja u ovoj oblasti. Analizirani su vokali i nevokali koje su izgovarali i muskarci i zene, u kon-tekstu reci ili izolovano. BPA je koriscen uz standardni gradijentni metod, koji je prilagoden LM metodom. U radu je koriscen OBS za pruning. Kriterijum zaustavljanja pruninga su minimizacija NSSEtest i FPE.
Prikazane su vrednosti dobijenih gresaka za vokale i nevokale, pojacanja FPE, kao i rezultati kroskorelacione analize signala na izlazi-ma neurona skrivenog sloja FNN.
Modeli
Ukoliko je u obradi govora dostupan samo govorni signal koriste se AR modeli sa dva pola na priblizno (2n+1 )*500Hz, n = 0, 1,... Ukoliko je na raspolaganju i signal sa glotisa koriste se ARX linearni modeli
o
X
o >
o (N
0¿ ljü 0£ ZD O O
-J <
o
X
o
LU
I— >-
Q1
£
< -j
O ■O
X LU I— O
o >
sa dodatnim ulazom. Uz to, pokretna srednja vrednost greske koristi se u ARMA(X) modelima, kada je dostupna korekcija greske. Medutim, ta-da postoji problem nestabilnosti u procesu obucavanja ukoliko je vrednost greske velika, sto moze dovesti do nestabilnosti modela. Zbog toga se u modelovanju koristi nelinearna FNN na koju je moguce priment pruning, odnosno proces odbacivanja viska parametara u odnosu na potpuno povezanu strukturu, tako da ukupna greska obucavanja ne prelazi dozvoljenu vrednost. Kriterijum zaustavljanja pruninga je dosti-zanje minimuma NSSEtest, NSSEtrain ili FPE. Nelinearni modeli su, u opstem slucaju, tacniji, ali proces njihovog obucavanja traje duze.
Obucavanje modela
FNN i AR modeli su obucavani trening skupovima. Obucavanje je izvedeno promenom parametara po BPA. Koriscena je LM aproksima-cija za proracun Hessianove matrice. Optimalni korak promene greske aproksimiran je Taylor-ovim nizom. Aproksimacija drugog reda ukazuje na nekorelisanost ulaza sa dobijenom greskom, sto omogucuje ispra-van smer korekcije greske. Koriscene su MATLAB-ove metode nnarx i marq. Treniran je i AR-10 ciji je red jednak broju ulaza u FNN (10), odnosno procenjeni izlaz dobijen je na osnovu 10 prethodnih vrednosti datog signala. Inicijalna vrednost parametara je slucajna. Formantne karakteristike vokala su takve da njihov broj i raspored odreduju parametre modela. AR model je stabilan, jednostavan i racunarski malo zahtevan. Predikcija je bazirana na MSE kriterijumu. Za FNN koriscen je OBS pruning. Za promene greske racuna se puna Hessian-ova ma-trica. Akaike-ova FPE omogucuje da se proceni generalizaciona greska za datu FNN, kada je poznat broj parametara. Da bi bilo moguce uporediti AR i NNAR modele uvedeno je pojacanje FPE, tj. odnos MSE za AR model i FPE za FNN, a validacija je izvedena za sve vokale i sve govornike. Isti proces izveden je i za govornike i nevokale koji su izgovarani u kontekstu reci ili van njih.
Signali govora
Vokalno-nazalni trakt je deo sistema za proizvodenje govora, cija se prenosna funkcija moze aproksimirati akustickim filtrom. Vazduh, pobuda iz pluca, prolazi kroz vokalno-nazalni trakt i, u zavisnosti od toga da li glasne zice vibriraju ili ne, formira se vokal ili nevokal. Zvuk koji se cuje kao govor nastaje zracenjem sa usana i iz nosa. Vokali su kva-ziperiodicni u duzem vremenskom periodu, pobuda je snazna, a glasne zice vibriraju. Kod ostalih fonema kvaziperiodicnost je zanemariva, pobuda je slab signal ili kombinacija takvog signala sa sumom.
Za obucavajuce skupove trenirani su AR-10 i FNN, strukture 103-1. Pruning je izveden OBS metodom sa maksimalno 20 iteracija retreninga po odbacivanju jednog parametra. Koriscen je algoritam
Rezultati
nnprune. Dobijene su NSSE za obucavajuci i test skup, i FPE. U radu su prikazane strukture koje zaustavljaju pruning dostizanjem minimal-nih vrednosti NSSEtEst i FPE. Izracunata je i NSSE za AR-10. Valida-cija je izvedena funkcijom nnvalid. Za nevokale racunato je pojacanje FPE za zene i za muskarce. Uvedena je mera rastojanja dva signala (u spektralnom domenu) i poredeni su spektri snage signala na izlazima neurona skrivenog sloja. Takode, izvedena je kroskorelaciona analiza i kumulativno sumiranje apsolutnih vrednosti kroskorelacionih signala za male distance.
Zakljucak
U radu je analizirana klasa FNN, strukture sa 10 ulaza, promenlji-vim brojem neurona u skrivenom sloju i jednim izlazom, za predikciju govornog signala, tj. fonema srpskog jezika. Metodologija izbora arhi-tektura sa dobrim generalizacionim osobinama, zasnovana na prunin-gu, omogucila je znatno smanjenje broja parametara modela i vecu tacnost, u odnosu na linearne AR modele. Granicne arhitekture odliku-ju se minimalnim brojem parametara u okviru zadate margine greske. Pri analizi vokala uocen je uticaj nevokalizovanih fonema koji su takode prediktovani FNN i AR modelima. Radi sagledavanja diskriminacio-nih osobina izabranih klasa modela razvijena je metoda visedimenzio-nog skaliranja zasnovana na novoj meri rastojanja. Analiza gubitka dis-kriminatornosti ukazuje na cinjenicu da FNN modeli za foneme u srp-skom jeziku imaju znatno vecu diskriminacionu snagu, sto ih cini upo-trebljivim u sirokoj klasi prepoznavanja govornih elemenata. Spektralna analiza pokazuje da su izlazni signali neurona skrivenog sloja dobro korelisani sa dominantnim formantnim karakteristikama ulaznog signala. Vremenska karakteristika ukazuje na slabu statisticku zavisnost ovih signala za niske redove kroskorelacione zavisnosti (do petog reda). Analize ukazuju na blagu prednost kriterijuma NSSEtEst u odnosu na FPE kriterijum, na nezavisnom signalu. U slucaju kratkih obucavaju-cih skupova FPE je prihvatljiv kriterijum.
Rezultati ukazuju na cinjenicu da predlozena klasa FNN modela srpskog jezika i izbor arhitektura sa najboljim generalizacionim svoj-stvima obezbeduju modele visoke tacnosti sa internom distribuiranom strukturom koja odgovara prirodnom vremensko-frekvencijskom sadr-zaju ulaznih signala, i visokih su diskriminaconih svojstava za isti broj parametara u odnosu na tradicionalne linerane modele.
Kljucne reci: AR model, neuronske mreze, govor.
Datum prijema clanka/Paper received on: 18. 12. 2013.
Datum dostavljanja ispravki rukopisa/Manuscript corrections submitted on: 06. 03. 2014.
Datum konacnog prihvatanja clanka za objavljivanje/Paper accepted for publishing on:
08. 03. 2014.
I
h-
!± p
le
d
o
m
ar
e
o
d n a
ui
e
m
e
n
o
h p
n
ia ibr
er
S f o is is al
n a e iv it ra
ar p
m o c
<
d
it ro Pr