Impact of the glottal signal on the prediction of speech

Protic Danijela D.

QPMn/lHA^HM HAyMHM HmНЦM ORIGINAL SCIENTIFIC PAPERS OPiriHAflbHblE HAYHHHE CTATbll

IMPACT OF THE GLOTTAL SIGNAL ON THE PREDICTION OF SPEECH

Danijela D. Protic

General Staff of the Serbian Army, Department of Telecommunications and Information Technology (J-6), Centre for Applied Mathematics and Electronics, Belgrade, e-mail: [email protected]

DOI: 10.5937/vojtehg63-6357

FIELD: Telecommunications ARTICLE TYPE: Original Scientific Paper ARTICLE LANGUAGE: English

Summary:

In this paper, several linear and nonlinear techniques for speech processing based on AR, ARX, ARMAX, WLS and FNN models are proposed. The impact of the glottal wave to modelling is also shown in details. GD, BPA and LM approximations are used for model training and optimization. A comparative experimental analysis of five considered models is done based on the prediction of a speech signal. The results on training and testing are presented through learning and training errors for all models given.

Key words: Linear models, Prediction, Glottal signal, Feed-forward neural network, Speech.

Introduction

When speech occurs, the air from the lungs propagates along the trachea and the vocal tract to the lips, where it is radiated into the environment. Vibrations of the vocal cords change the airflow that passes through the glottal and vocal tract, where the shapes of the nasal cavity, the tongue, teeth and the lips determine the output wave, i.e. speech. Speech is classified to unvoiced and voiced, depending on the nature of the excitation (Burrows, 1996). For unvoiced speech, the vocal cords are wide apart and the air passes freely through the glottal tract where a noise-like, low-power signal arises. The excitation is due to turbulence generated by the airflow pass-

CI}

o

X

o >

LO

o <N

ÜÜ 0£ ZD O O

_l <

o

X

o

LU

I— >-

Q1 <

I—

C0 <

-J

O ■O

x

LU I— O

O >

ing through a narrow constriction and tends to be random in nature. For voiced speech, the excitation of the vocal tract originates at the glottis. When the vocal cords are close together, the air pressure causes them to vibrate and thus forms a strong signal, i.e. vowel. This vibration is periodical, and its frequency (or pitch), is controled by the tension in the vocal cords.

The most popular technique for speech processing is Linear Prediction (LP). LP uses a source-filter arrangement to model the system which assumes that the source is located at the glottis and that the linear filter can be used to model the frequency properties of the vocal tract. The main disadvantage of LP is that the source and the vocal tract filter are not decoupled in the analysis and the LP filter thus combines the effects of the source and vocal tract. Other approaches interpret the voiced speech signal following Auto Regressive (AR) model, Auto Regressive with exogenous input (ARX) model, and Auto Regressive Moving Average with exogenous input models. AR model parameters are estimated for time series using the variants of Linear-Squares (LS) method that minimizes the summed squares of errors which are assumed to be normally distributed. For multivariate data, ARX is used. A current output depends on previous outputs, previous and delayed inputs as well as a white noise disturbance value. A generalization of the ARX model, ARMAX, also includes the output error.

It is usually assumed that the response data is of equal quality and, therefore, has the constant variance. If this assumption is violated, the Weighted Least Squares (WLS) algorithm can be used to improve the fitting process by including the additional scale factors (weights). The weights determine how much each response value influences the final parameter estimate.

When the input/output dynamics of a system contains a nonlinear component, a common linear modelling procedure has to resort to changing to the nonlinear dynamic modelling. The most used nonlinear models for the prediction of speech are multilayered networks generally called the Multi Layer Perceptrons (MLPs). They allow non-linear mappings by a learning procedure that consists of adjusting synaptic weights which are fully connected and arranged in layers (Sainath et al, 2011), (Pamucar, Borovic, 2012), (Milicevic, Zupac, 2012). MLPs have become very popular in solving various problems such as regression, classification, time series processing, identification and control of dynamical systems (Haykin, 1994), (Narendra, Parthasaranthy, 1990). An MLP is a feed-forward neural network (FNN) with one or more hidden layers between the input layer and the output layer. Feed-forward means that data flows in one direction from the input layer to the output one. For the FNNs having differentiable activation functions, there exists a computationally efficient method, called the Back-Propagation Algorithm (BPA), used for finding the derivatives of an error function with respect of the

network weights. Typically, the BPA uses the gradient descent (GD) training algorithm. The network weights are moved along the negative of the gradient to find a minimum of the error function (Silva et al., 2008), (Wu et al., 2011). However, the GD is relatively slow and the network solution may become trapped in one of the local minima instead of the global minimum. For these reasons, there are some other procedures such as the Levenberg-Marquardt (LM) algorithm, available to use in order to improve the standard BPA. It gives efficient solutions of convergence and better optimization than the GD (Riecke et al., 2009), (Shahin, Pitt, 2012). The LM combines advantages of the GD method (that is, minimization along the direction of the gradient) with the Newton method (that is, using a quadratic model to speed up the process of finding the minimum of a function) (Levenberg, 1944), (Marquardt, 1963).

This paper presents the impact of the glottal signal to the prediction of speech that is based on five different models. The AR model parameters are estimated by a training procedure based on the LS method. The goal is to prove that a high order model can improve modelling even though the glottal signal is not used for the prediction. Additionally, the WLS is used to demonstrate the influence of weights to the LS. Furthermore, in order to obtain the vocal tract transfer function and the glottal source parameters, the ARX model is estimated. In this way, the influence of the glottal signal on the evaluation of the model should decrease the error. For the ARMAX model, a sample of the output error is used to improve prediction. However, it does not influence modelling when a vowel is used for model estimation. Finally, the FNN with one hidden layer and the tangent hyperbolic activation functions for all neurons are used for non-linear modelling. The LM algorithm is applied for model evaluation. The results show that the mapping function gives better results for the FNN model than for all other models. The minimum training error is the estimation criterion for the model training. Finally, the models are tested and the test errors are used to compare the quality of prediction.

The article is organized as follows. Second section presents the optimal linear and nonlinear models. Linear prediction, the WLS, the influence of the glottal wave and the FNN learning are shown in details. The LS and the weighted LS are presented. The GD and the BPA are shown in more detail. The principles of the LM method are shown. The results are given in Section three. Finally, the paper ends with some concluding remarks.

Although they are two mutually separated and independent processes, the speech analysis and the speech synthesis are often implemented simultaneously. The analytical process determines the characteristics of excitation, the glottis and the vocal tract. The synthesis gener-

co I

CD

C cp

o <u <u cp tn

M—

o c

o +-<

o

T3 <D

<u

-C

c o

"ro c 5?

<2 eg <u .c

M—

O '

O

ro p

E d

O CL

Linear and nonlinear parametric models

ates signals that can be used for speech or speaker recognition, to simulate or reject the side effects, etc. The analysis involves the phonetic features of the spoken content but the level of the estimated error is high, and the assessment methodology encompasses a wide range of models with a high degree of freedom. In the synthesis, the excitation signal can be a pulse or noise, or may be generated by the Linear Prediction Coder (LPC), which is applied in order to ensure a high quality of speech, assuming that the speech sample is a linear combination of the previous samples. The LPC is carried out as follows: 1) the new model parameters are estimated, 2) the Mean Squares Error (MSE) is calculated to re-perform the synthesis, and 3) acceptable results are obtained by all-pole models, as it will be explained in details later in the paper. Thereafter, the spectrum of the excitation, and the transfer function of the vocal tract are simulated. The main advantages of this technique are the automatic analysis of the original signal and the accuracy of the estimate. Still, there are discontinuities in all-pole modelling because models do not take into account the characteristics of nasals, plosives and fricatives, which enter zeroes to transfer functions.

For that reason, linear and nonlinear modelling is presented here. The influence of the model order, the glottal signal and the disturbance factors to the speech prediction are shown.

Linear prediction

Linear prediction (LP) determines the value of the nth sample of the signal y(n) that is based on the all-pole model. It is well known that the assumption of linearity does not exactly match the characteristics of speech. Nevertheless, a high-quality LP model has advantages over complex non-linear models, such as a simple sturcture and a minimal prediction error. The AR model that is the most commonly applied in the LP is given with the formula (1)

y(n) + axy(n -1) +... + ana y(n - na) = e(n) (1)

whereas y(n) is an input, aj (/=1...na) are the model parameters, and e(n) is an error. If the extra input, in this case of the glottal signal, is also processed, the AR model expands to the ARX model. See (2).

y(n) + aiy(n -1) +... + anay{n - na) = ...

... = biu(n -1) +... + bn u(n - nb) + e(n) (2)

b (/=1...nb) are the eXogenos parameters. The generalization of the model, known as the ARMAX, includes the error propagation. See (3).

y{n) + axy{n -1)+... + a„ay(n - na ) = ... ... = biu(n -1) +... + bnbu{n - nb) +... ... + e(n) + cie{n -1)+... + cn (n - nc)

(3)

C (/=1...nc) are the MA parameters, which are neglected if vowels are processed, because the disturbances, if at all present in vowels, are insignificant compared to the signal.

In this experiment, training was carried out by the BPA parameter changing. The optimal step size was reached by the GD method (Haykin, 1994), (Svarer 1995), given with the formula

E » E 0 +

dE_

du

\T

Su +1 Su T H^u 2

(4)

where E is the error, E0 is its approximation, u is a parameter vector, Su is the parameter deviation, and H is the Hessian symetric matrix of the second derivates of E.

u = [«!,u 2,..., un J

dE " dE dE dE ' T

du du1 ' du 2 'dun _

" a2 e a2 e d 2 E

duj2 du1du 2 du.du 1

d 2 E du 2 a2 e d 2 E a2 e

H = du2du1 du 2 du 2 du

d 2 E d 2 E a2 e

dundu1 dundu2 du„2

The parameter estimates are obtained in the following way

Si = u * - u = -H -1 dE = 0 du

(5)

n

dE

U * = u - H -1 — (6)

du

where u* is the estimated parameter vector. A problem that raises is finding H-1. The number of calculations of the n dimensional Hessian matrix inverse is n3 that is computer demanding, so H-1 has to be approximated. One of robust but very simple methods for matrix approximation is known as the Levenberg-Marquart algorithm (Le Cun et al., 1989), (Svarer, 1995), which is presented later in the paper.

Weighted Least-Squares

The WLS is an recursive algorithm with slowly decreasing weights, which is found to have a self-convergence property, i.e., it almost certainly converges to a certain random vector, irrespective of the control low design (Childers et al., 1995). This universal convergence result combined with a method of random regularization can easily be applied to construct a self-convergent and uniformly controllable estimated model and thus enable making a general framework for adaptive control (Guo, 1996). The WLS is an efficient method that makes good use of small data sets, having the ability to provide different types of easily interpretable statistical intervals for estimation, prediction, calibration and optimization. Given a sequence of the stochastic observation vector Q e R", let us consider the scalar process yt generated according to the following time-varying equation

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

yt+i = W(t +

The scalar wt is a disturbance term, and ( e R" is a stochastic sequence of unknown parameter vectors (regressors). The LS fitting technique is the most commonly applied way to estimate the parameter ( by minimizing the sum of the squares of the residuals. The estimate is the minimizer of the following criterion

i t i t

J (W = 2 Z (y+1 -eT( )2 = 2 z e(t )2

2 i=0 2 i=1

When the squares of the residuals are used, outlying points can have a disproportionate effect on the fit. The WLS reflects the behavior of the random errors in the model by incorporating extra nonnegative constants, or weights, associated with each data point, into the fitting criterion. Optimizing the criterion to find the parameter estimates allows the weights to determine the contribution of each observation to the final parameter estimates. The WLS error function e(i) given with the formula (7)

1 t

Jt = 2 Zae(() (7)

whereas 0<a is the weighting sequence, a so-called forgetting factor, which allows different measurements of interest. The forgetting factor is introduced to discount old data in favour of fresh information. The selection of the value for a is a user's choice as have been discussed by Ljung and Soderstrom (1983), Goodwin and Sin (1984), and (Campi, 1994). The size of the weight indicates the precision of the information contained in the associated observation. The forgetting factor a usually takes the exponential form a =At-1, 0<A<1. Writing the criterion with an exponential forgetting factor

1t

Jt = 2 Z 2t-'e()

2 ¿=0

Assuming that the non-stationary signal consists of stationary segments (A<1, A~1), the forgetting factor is:

jt = etln(2) = etln (1+2-1) _ e-t(1-2)

2 =-t/t, T = 1/(1 -2) (8)

where t is the effective memory of the algorithm, i.e. the memory length.

The WLS parameter estimation can easily be constructed so that the corresponding estimated model is almost surely self-convergent and controllable. Using weights that are inversely proportional to the variance yields the most precise parameter estimates possible (Ljung, Soderstrom, 1983), (Guo, 1996). Consider the following ARMAX model

A(z)yt = B(z)ut + C(z)wt, t > 0

A(z) = 1 + alz1 + ••• + apzp, p > 0

B(z) = bz + ••• + bqzq, q > 1

C(z) = 1 + qz1 + ••• + crzr, r > 0

where yt, ut, and wt are the system output, input, and noise sequence, respectively, and A(z), B(z), and C(z) are polynomials in the backward-shift operator z with unknown coefficients and known upper bounds p, q, and r, for orders. To describe the WLS algorithm for estimating an unknow parameter wector

0 = [-a1 ... - apb1 * * * bq C1 •••cr ]T

the recursive algorithm is applied. It has the following form

e,=e + L (y,+1 -eT)

L = P

t -1 T n

a +VtptVt

P = P - P&tVtPt

t+1 t ^ . T r>

a +VtPtVt Vt = [y t -yt-p+iut ■■■ut-p+i&t ■■■®t-r+1 ]

(Ot = yt -0Tt^t-1, t > o

where at is the weighting sequence, and the initial values d0 and P0=ai, (0 < a < 1) are chosen arbitrary. Various versions of this algorithm are studied by many authors (Lee et al., 1981), (Ljung and Soderstrom 1983), (Campi, 1994), (Guo, 1996), (Macchi,1986), (Widrow et al., 1976), (Kovacevic et al., 2000), (Jing 2012). Their work aims at studying the performances of algorithms in a stochastic framework. The following questions motivate almost all the papers pertaining to the performance analysis of adaptive identification algorithms: a) Is the algorithm able to keep estimation error bounded? b) What does the estimation error depend on and in what way?

The impact of the glottal signal on modeling

The noninvasive methodology for recording the signal that flows through the glottis, before it modulates into speech, is known as the Electro-GlottoGraphy (EGG). The method examines the vibration of the vocal cords by measuring the impedance through the throat of the subject. Electrodes are placed outside, on the larynx. When the vocal cords are closed together, electricity passes through the person's neck, and the impedance is low, while the opened vocal cords make that extremely difficult, and the impedance is high. The change of impedance indicates a change of the glottal flow.

According to Fant (1960), the speech wave is the response of the vocal tract filter system to the sound sources. This rule is know as the source-filter theory of speech production. For vowels, the source of sound is the regular vibration of the vocal cords, and the filter is a vocal tract tube between the larynx and the lips. Regular vibrations of the vocal cords result in the periodic excitation source, which is always in the larynx, usually in the glottis. A period is the duration of one glottal cycle (opening and closing phase). The waveform of the sound is complex, i.e. its wave-shape depends on the relationship between various frequencies that it contains. In the source-filter theory, the frequencies (formants) are responses of the vocal tract filter. Literature suggests that at least a pair of poles is needed for each formant representation (10-16 poles), which is expected in the

frequency range, and another pair of poles for the impact of the glottal flow (Kovacevic et al., 2000). The glottal-flow velocity can be thought of as a low-pass filter filtering of an impulse stream (Gutierrez-Osuma, 2011). Vowel 'a', and a corresponding glottal signal (egg), for a female subject during normal phonation, are presented in Fig. 1. The sampling frequency for the signals is fs=10kHz. Each sample is 0.1ms apart. Therefore, n=300 samples of signals is equivalent to the time period of 30ms.

egg(n),

Speech signal

50

100 150 200 250 300

Glottal signal

50 100 150 200 250 300

Slika 1

Figure 1 - Upper panel: time-domain speech signal (vowel 'a'). Lower panel: glottal flow waveform of the vowel 'a' (egg) ■ Gore: govorni signal (vokal 'a'). Dole: oblik glotalnog talasa za vokal 'a' (egg)

Direct observation of the glottal behaviour is rather difficult, which implies the development of computational procedures for the estimation of the glottal source directly from the speech signal. Some of the most known and used models are Rosenberg (Degottex, 2010), Liljencrants-Fant (de Oliviera Dias, 2012), Klatt (Klatt, Klatt, 1990) and Strube model (Kovacevic et al., 2000) that is given with the formula (9).

Ug (t)

sin

cos

nt

n(t - T0) 2T

0 < t < T

T, < t < T

Og J

T = T + T

og s n

(9)

T < t < T T = T + T

1 og -l - 1 0 > 1 0 1cg

n

0

where ug(t) is the glottal flow, T0 is the fundamental frequency period, T^ and Teg are the periods of the open and the close phase of the glottal wave, respectively, Ts and Tn indicate the slow growth phase (Ts) and the phase of fast decrease (Tn), which make the phase of the open glottis (Tog). But in this paper, the influence of the glottal signal obtained by the EGG on the prediction of the corresponding speech signal is examined. The polynomial model of the glottal flow is used as an exogenous part of the ARX and ARMAX models. Also, it is used to improve the training of the FNN.

Feed-forward neural network learning

Each nonlinear system can be modeled by the dynamic parameter function

g (yt ,St, t ) = e(t)

where #T=[-y.1,...,-yt-n] is the vector of n samples of the sequence y, e(t) is an error, and g is the parametric function known in advance (Svarer, 1995), (Arsenijevic, Milosavljevic, 2000). It is shown that the FNN with three layers (input, hidden and output), and sigmoidal-type nonlinearity can approximate any nonlinear function and generate any complex decision region needed for clasification and recognition tasks (Azimi-Sadjadi, Liou, 1992), if the choice of inputs, the dimensionality of weight space and the transition of learning are properly suited. For the given inputs and weights, the output of the FNN is given with the following expression

( q fm \ ^

y(w,W)=F j + Wj0

vj=1 vl=1 y

+ W 0

yi is the output, w and W are the synaptic weight matrices, fj and Fi are the activation functions of the hidden and output layer, respectively, while q and m represent the number of nodes in the network (Arsenijevic, 2001).

The problem of the neural network learning can be seen as a function optimization problem. Let us consider the FNN with differentiable activation functions of both input variables and weights. Each unit computes a weighted sum of its inputs

a = Z j

i

where z is the activation which sends a connection to the unit j, and wj is the weight associated with the connection. The summation is transformed by a nonlinear function g(...) to give the activation zj of the unit j in the form zi=g(ai) The error function, which is a sum of all paterns in the training set, is defined on each pattern separately

E = V En

b

where E" = E"(y1.....yc). The goal is to evaluate derivatives of the error E"

with respect to the weights

dEn _ dEn da,

(10)

dw, da. dw..

" J y

where

dEn da,

S, = —, = z, (11)

da, dw7i

J J1

which gives

dEn --= S.z,

dwy J 1

For the output units, the error 5k is given with the equation

n

= = )dE

k

s dEn ,dEn

Sk = ^~ = g (ak )-

dak ^k where g'(a) substitutes dE"/dy, while for the hidden units

. dEn _ dEn dak

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

S, = — = V--k-

J da, k dak da, that gives the back-propagation formula:

Sj = g (aj w,S>

k

5's can be evaluated backward since 6's from the outputs are known.

The BPA can also be applied for the calculation of other derivatives. Let us consider the evaluation of the Jacobian matrix, whose elements are given by the derivatives of the network outputs yk with respect to the network inputs xi

j =dZL=v dz±-=v w d^L =v w v d^L

k dxi k da, dxi , ]1 da, , 11 , dat da,

- = V w,g V w, (12)

To evaluate second order derivatives, let us consider the following error derivatives:

2 , .n

= Z iy!iy! + z (yn - tn (13)

dw j dwm n W dwm n dwJ1 dwlk

are the elements of the Hessian matrix. If the network outputs yn are very closely to the target values f, then the second term in (13) can be neglected, which gives an LM formula:

d2 E = z dyL dyl

dwj dwik n W dWk

The LM algorithm provides a numerical solution to the problem of minimizing a (generally nonlinear) function, over a space of the parameters for the function (weights) (Kashyap, 1980), (Ljung, 1987), (Larsen, 1993), (Hansen, Rasmusen, 1994), (Fahlman, 1988). See (5)-(6). The LM basically consists of solving the equation

(H + X\_ = JT E

where A is the Levenberg's damping factor adjusted at each iteration guiding the optimization process, and 5 is the weight update vector that shows how much the network weights should be changed to achieve a better solution. If the reduction of E is rapid, a smaller value of A brings the algorithm closer to the Gauss-Newton algorithm, whereas if the iteration gives insufficient reduction in the residual, A can be increased, giving a step closer to the GD direction.

The problem of parameter adjustment (see 13) has been solved by Hassibi and Stork (1993). They have used the outer product approximation to develop a computationally efficient procedure for approximating the inverse of Hessian:

N T

H N = Z g n (g n )

n=1

where N is the number of the parameters in the data set, and the vector g is the gradient of the error function. The sequential procedure for building up the Hessian is obtained by separating the contribution from the data point N+1 to give:

Htt , „N+1 ( N+1 V N+1 = H N + g (g )

In order to evaluate the inverse Hessian, let us consider the matrix identity:

(A + BC)-1 = A-1 - A-1B(l + CA-1B) CA-1

where I is the identity matrix. If A=Hn, B=gw+1, C= (gN+1)T

H -lg N +1 ( N +1 )H

tt —1 _ tt-1 "-nfe vfe ) N ¡,\A\

Hn*1_ Hn — 1 + (gn+1 )Hn'gn+1 (14)

The initial matrix H0 is chosen to be al, where a is small quantity, so that the algorithm actually finds the inverse of H+al.

The updating parameter procedure is carried out in the following way:

Step 1: propagate the input signal through the FNN in the forward direction to obtain actual outputs for each training signal, at each layer.

Step 2: generate the output signal at the output of each layer for each node. At the output layer, this error is simply formed by comparing the actual outputs with the desired signal. For other layers, the error is propagated backward through those layers with updated weights until the errors at the outputs of the lower layer with weights to be updated are generated.

Step 3: compute the matrices for updating weights.

Step 4: determine the state of the particular node. If the input to this node is within the ramp region, then proceed; otherwise, there is no need for weight updating and then examine the next node.

Step 5: Update the weight vector using the recursion, and repeat steps 4 and 5 for the next node until all the weight vectors in this layer are updated.

These steps are performed for all the layers several times for a given training set until the error converges to within an acceptable range. After the network updating is finished, the pruning of parameters is carried out in the following way

dum + um = 0

m m

eT du + u„ = 0

where um is the m-th parameter, em is the unit vector of the same dimensions as du. The objective of this methodology is to prune the parameter um that would cause minimum increase of an error in the following way

du = — AH _1e m 2

e m H —1e m

dU = -TU—T- H —1e „ ermH—1e m "

m

The Hessian matrix inverse is used to identify the least significant weights (Silva et al., 2008).

Results

A comparative analysis of five different models, estimated on the basis of speech and glottal signals, provides the understanding of the impact of the glottal signal on the estimation of the model parameters. The evaluation criterion is the minimum training error, which is presented graphically and in the percentages for all the models. The parameters are estimated on 300 samples of a vowel 'a' and the corresponding glottal signal, pronounced by a female speaker, during normal phonation. The sampling frequency is 10 kHz. The training set length captures about 5 glottal cycles (not quite). For the evaluation of the algorithms, the MATLAB functions are applied. The hyperbolic tangent (tanh) function is the activation function for all neurons, because it is the rational function of exponential, i.e. the first and second derivatives of tanh always exist (Wall, 1948.). Since the output of tanh was limited to approximately [-1, 1] for all inputs within [-1, 1], speech and glottal signals were also normalized to the same limits.

The model orders were as follows:

AR: na=25

ARX: na=14, nb=4

ARMAX: na=14, nb=4, nc=1

WLS: na=25, a=0,95.

where na corresponds to the speech, nb to the glottal signal, nc to the output error, and a to the initial weight. The FNN with one hidden layer is trained using a training set that consists of 14 samples of speech and 4 samples of the glottal signal for the prediction of one speech sample. A hidden layer contains three neurons and the output layer a single neuron, i.e. the network structure is 18-3-1. Prior to the training, the weights were initialized to small random numbers. The LM training is used to progressively reduce the total network training error.

Fig. 2 shows the training sets of a vowel 'a' and the egg signal as well as the corresponding training errors for the WLS (uhatt), FNN (eNNAR-max), AR (eAR), ARX (eARx) and ARMAX (eARMAx) models, respectively.

Training set

a (--) and egg signal (-.)

eNNARMAX 50

0.05 r

100

150

200

250

300

0 r^-^-^fy^^ ^-J

50

100

150

200

250

300

Figure 2 - Vowel 'a' and the glottal egg signal (1), uhatt (2), eNNARMAX (3), eARX (4), eARMAX (5) and eAR (6) - training set Slika 2 - Vokal 'a' i glottalni egg signal (1), uhatt (2), eNNARMAX (3), eARX (4), eARMAX (5) i eAR

(6) - obucavajuci skup

The speech and glottal signal sets contain 600 samples. The sets are divided into two equal parts. The training set (300 samples) is composed of the first 300 samples while the test set consists of the following 300 samples.

For all models, the errors indicate the opening and closing of the vocal cords. As expected, the training of the FNN gives the lowest error value. The input/output mapping function shows that the model does find the minimum training error. Also, the eARMAX shows a large impact of the glottal signal on the model evaluation, which is not the case for the eAR, eARX and uhatt. As expected, the error of the FNN model is the lowest in comparison to other errors.

After training, the models are tested. The results, which are presented in Fig. 3, show that all the test errors are higher than the corresponding training errors. The uhatt is about four times higher, the eARMAX increase is about three times, and the eAR and the eARX are doubled. The FNN shows the increase of the error a little less than four times; however, this value is also significantly lower than the values of the test errors of other models.

n

0

n

0.5 0

0.2 0

0.2 r 0 -

0.2 0

02 -0

0.2 0

50

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

50

100 150 200

2) uhatt test

100 150 200

3) NNARMAX model

100 150 200

4) ARX model

100 150 200

5) ARMAX model

100 150

6) AR model

200

250

300

50

100

150

200

250

300

Figure 3 - Vowel 'a' and the glottal egg signal (1), Uhatt (2), bnnarmax (3), eARx (4), eARMAx (5) and eAR (6) - test set Slika 3 - Vokal 'a' i glottalni egg signal (1), Uhatt (2), eNNARMAx (3), eARx (4), eARMAx (5)

i eAR (6) - test skup

Table 1 summarizes the results of the minimum and maximum training and the test errors for the given models.

Table 1

Minimum and maximum errors for the training and the test sets Tabela 1 Minimalne i maksimalne greske za obucavajuci i test skup

Training set

Test set

Error min max min max

uhatt -0,0646 0,1285 -0,2646 0,2305

eARMAX -0,0614 0,0996 -0,1777 0,1213

eAR -0,0737 0,1251 -0,1956 0,1411

eARX -0,0738 0,1084 -0,1893 0,1308

eNNARMAX -0,0366 0,0440 -0,1161 0,0927

0

Results indicate the following:

Training error of the AR model (na=25), having twice of the required parameters than common AR models (na=10-16), is similar to the one of the ARX model (na=14, nb=4).

uhatt and eARMAX (na=14, nb=4, nc=1) are smaller than the errors of AR and ARX models, which shows the impact of the weights on the LS modelling, as well as the impact of the glottal signal and the output error on the prediction of speech.

Training error of three-layer FNN that has 18 inputs (14 inputs for speech and four inputs for glottal signal samples) provides almost half the training error than other models.

Test errors are 3-4 times higher than the training errors for each model, which is particularly noticeable for uhatt.

Minima and maxima of the test errors for AR, ARX and ARMAX models differ in ~1%.

The results show that the WLS model has the greatest volatility in testing; the uhatt is approximately two times higher than the errors of other linear models.

After testing, the FNN does not show changes in characteristics, although the test error is slightly higher.

Conclusion

This paper presents the impact of the glottal signal on the prediction of speech, which is based on linear and nonlinear models. AR, ARX, ARMAX models, the WLS algorithm and the FNN are used for the prediction. The training of the models is performed on a vowel 'a', pronounced by a female speaker, during normal phonation.

For the training, the BPA is used for fitting the model parameters. The parameter change is carried out by propagation along the negative of the gradient to find a minimum of the error function. The LM algorithm, which is used to speed up and ease calculation of the Hessian matrix, showed significant advantage over the GD algorithm. The LM combines the minimization along the negative direction of the gradient and the Newton method based on a quadratic model to speed up the process of finding the minimum of a function.

A comparative analysis of the training and the test errors shows that the high-order AR model as well as the WLS algorithm give higher errors if compared with ARX and ARMAX models for which the glottal signal influence prediction. The training errors show that the impact of the glottal signal is higher in the phase of the open glottis than in the phase of the closed glottis. The LP models also show robustness. The results indicate that the high-order AR model can be an adequate substitute for the ARX model if the glottal signal is not available for the prediction. The

C2D

o

X

o

>

LO

O <N

a:

yy

0£ ZD

o

o <

o

X

o

LU

I— >-

CC <

I—

OT <

-J

CD >Q

X LU I—

o

o >

WLS model improves prediction by including the weight parameter, while the ARMAX model shows significant reducing of the training errors, because of the glottal signal and the output error, which were used for training. The results also show the minimum error of the FNN model. The FNN with one hidden layer and tanh activation functions for all neurons showed that its input-output mapping gives the model that predicts the speech signal much more precisely than linear models.

According to the results, if the glottal signal is available for model training, the FNN should be used whenever possible, due to the precision of estimates, although the sensitivity of the model is increased and training time takes longer. However, if this is not the case, the high-order AR model can be a replacement for ARX and ARMAX models. The results of the WLS training show that, although the training gives satisfying results, the testing shows higher errors, so models based on WLS should not be used for this purpose.

References

Akaike, H., 1969, Fitting Autoregressive Models for Prediction, Ann. Ins. Stat. Mat.

Arsenijevic, D., 2001, Analiza neuronskih modela vokala srpskog jezika, Magistarski rad, Elektrotehnicki fakultet, Beograd.

Arsenijevic, D., Milosavljevic, M., 2000, O jednoj men rastojanja govornih signala zasnovanoj na neuronskim modelima, Zbornik radova DOGS, Novi Sad.

Azimi-Sadjadi, M.R., Liou, R., 1992, Fast Learning Process of Multilayer Neural Networks Using Recursive Least Squares Method, IEEE Transaction on Signal Processing, Vol. 40, No. 2, pp.446-450.

Burrows, T.L., 1996, Speech Processing with Linear and Neural Network Models, PhD Thesis, Queens' College, Cambridge University, England.

Campi, M.C., 1994, Exponentially Weighted Least Squares Identification of Time-Varying Systems with White Disturbances, IEEE Transactions on Signal Processing, Vol. 42, No. 11, pp.2906-2914.

Childers, D.G., Principe, J.C. Thing, Z.T., 1995, Adaptive WRLS_VFF for Speech Analysis, IEEE Transactions on Speech and Audio Processing, pp.209-213.

Degottex, G., 2010, Glottal source and vocal-tract separation. Estimation of glottal parameters, voice transformation and synthesis using glottal model. PhD thesis, Universite Paris, France.

De Oliviera Dias, S., 2012, Estimation of the glottal pulse from speech or singing voice, Master's Thesis, School of Engineering of University of Porto.

Fahlman, S.E., 1988, Fast-learning variation on back propagation: An empirical study, pp.38-51., Proceedings of the 188 Connectionist Model Summer Schools, San Mateo, Pittsburgh, USA.

Fant, G.1960, Acoustic Theory of Speech Production. Mouton, The Hague

Guo, L., 1996, Self-Convergence of Weighted Least-Squares with Applications to the Stochastic Adaptive Control, IEEE Transaction on Automatic Control, Vol. 41, No. 1, pp. 79-89.

Gutierrez-Osuna, R., 2011, Introduction to speech processing, CSE@TAMU, Available at: http://research.cs.tamu.edu/prism/lectures/sp/l8.pdf

Hansen, L.K., Rasmusen, C.E., 1994, Pruning from adaptive regularization, Neural Computation vol. 6, no. 6, pp.1223-1232.

Hassibi, B., Stork, D.G., 1993, Second order derivatives for network pruning: optimal brain surgeon. In S.J. Hanson, J.D. Cowan, C.L. Giles (Eds.) Advances in Neural Information Processing Systems, Volume 5, pp.164-171.

Haykin, S., 1994, Neural networks: A comprehensive foundation, New York: Macmillan.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Jing, X., 2012, Robust adaptive learning of feed forward neural networks via LMI optimizations, Neural Networks 31, pp.33-45.

Kashyap, R.L., 1980, Inconsistency of the AIC Rule for Estimating the Order of AR Models, IEEE Transaction on Automatic Control. AC-25, pp.996-998.

Klatt, D., Klatt, L., 1990, Analysis, synthesis, and perception of voice quality variations among female and male talkers. Journal of Acoustical Society of America 87, pp.820-257.

Kovacevic, B., Milosavljevic, M., Veinovic, M., Markovic, M., 2000, Robusna digi-talna obrada govornog signala. Akademska misao, Beograd.

Larsen, J., 1993, Design of Neural Networks, Ph.D. Thesis, Electronic Institute, DTH, Lyngby.

Ljung, L., 1987, System Identification: Theory for the User, Prentice Hall Inc.

Ljung, L., Soderstrom, T., 1983, Theory and Practice of Recursive Identification, Cambrige MA: MIT Press, p.36.

Le Cun, Y., Denker, J.S., Solla, S.A., 1989, Optimal Brain Damage, Advances in Neural Information Processing Systems 2, pp.598-605.

Levenberg, K., 1944, A Method for the Solution of Certain Problems in Least Squares, Quart. Appl. Math. Vol. 2, pp.164-168.

Marquardt, D., 1963, An Algorithm for Least-Squares Estimation of Nonlinear Parameters, SLAMJ. Appl. Math. Vol. 11., pp.431-441.

Milicevic, M.R., Zupac, Z.G., 2012, Objektivni pristup odredivanju tezina krierijuma, Vojnotehnicki glasnik/Military technical courier, Vol. 60, No. 1, pp.39-56.

Narendra, K.S., Parthasaranthy, K., 1990, IEEE Transactions on Neural Networks, 1, p.4.

Pamucar, S.D., Borovic, D.B., 2012, Optimizing models for production and inventory control using genetic algorithm, Vojnotehnicki glasnik/Military technical courier, Vol. 60, No. 1, pp.14-38.

Riecke, L., Esposito, F., Bonte, M., Formisano, E., 2009, Hearing illusory sound in noise: the timing of sensory-perceptual transformations in auditory cortex, Neuron 64, pp.550-561.

Sainath, T.N., Kingsbury, B., Ramabhadran, B., Fousek, P. Novak, P., Mohamed, A., 2011, Making deep belief networks effective for large vocabulary continous speech recognition, In Automatic Speech Recognition and Understanding, pp.30-35., 2010 IEEE Workshop, 11-15 December 2011, Waikoloa, HI.

Shahin, A.J., Pitt, M.A., 2012, Alpha activity making world boundaries mediates speech segmentation, European Journal of Neuroscience, 36, pp.3740-3748.

Silva, L., Marques de Sa, J., Alexandre, L.A., 2008, Data classification with multilayer perceptrons using a generalized error function, Neural Networks 21, pp.1302-1310.

Svarer, C., 1995, Neural networks for signal processing, Technical University of Denmark.

Wall, H.S., 1948, Analytis Theory of Continued Fractions, New York: Chelsea

Wu, W., Wang, J., Cheng, M., Li., Z., 2011, Convergence analysis of online gradient method for BP neural networks, Neural Networks 24, pp.91-98.

i

9 .p

p

c e

e p

s f o n io ti

ci id re pr

e h t

n o al

n ig

si

lo gl

e h t f o t c

a p

E d

,ci it ro Pr

ВЛИЯНИЕ ГЛОТТАЛЬНОГО СИГНАЛА НА ПРЕДИКЦИЮ РЕЧИ

ОБЛАСТЬ: телекоммуникация

ВИД СТАТЬИ: оригинальная научная статья

ЯЗЫК СТАТЬИ: английский

Краткое содержание:

В настоящей работе рассматриваются несколько линейных и нелинейных способов распознавания речи, основанных на моделях: АР, АРХ, АРМАХ, и алгоритмах №.ЛС и ФНН. В работе подробно представлено влияние глоттального сигнала. Апрок-симации ГД, БПА и ЛМ, применяемые при обучении и оптимиза-циии. Проведен сравнительный экспериментальный анализ пяти исследуемых моделей, основанных на предикции речевого сигнала. В работе приведены результаты обучения и тестирования, полученные на материале допущенных ошибок, при обучении применяемых моделей.

Ключевые слова: линейные модели, предикция, глоттальный сигнал, нейронная сеть, речь.

UTICAJ GLOTALNOG SIGNALA NA PREDIKCIJU GOVORA

OBLAST: telekomunikacije

VRSTA C LANKA: originalni naucni clanak

JEZIK C LANKA: engleski

Rezime:

U radu je prikazano nekoliko linearnih i nelinearnih tehnika za obradu govora, koje su zasnovane na AR, ARX, ARMAX modelima, WLS algo-ritmu i FNN. Detaljno je opisan uticaj glotalnog signala. GD, BPA i LM aproksimacija koriscene su za obucavanje i optimizaciju. Izvedena je komparativna, eksperimentalna analiza pet razmatranih modela koja je zasnovana na predikciji govornog signala. Rezultati obucavanja i testiranja predstavljeni su pomocu gresaka dobijenih u fazi ucenja i treninga za svaki od modela.

Uvod

Kad nastaje govor, vazduh iz pluca, preko trahee, ulazi u grlo i pobuduje glasne ¿ice, koje menjaju njegov protok, pa novonastali signal prolazi kroz glotalni i vokalni trakt, gde oblik usne i nosne supljine, jezika i zuba formira signal govora. Ukoliko su glasne ¿ice razdvojene, vazduh prolazi izmedu njih i nastaje sumolik signal male snage, a ukoliko su sastavljene, potisak iz pluca ih tera da kvaziperiodicno vibriraju formira-juci sna¿an signal, tj. vokal.

Najpoznatija tehnika za obradu govora je linearna predikcija (LP), koja koristi source-filter sistem za modelovanje sistema, koji podrazu-meva da je pobuda locirana na glotisu, dok se linearan filter koristi za modelovanje frekvencijskih karakteristika vokalnog trakta. Takode, kori-ste se AR, ARX i ARMAX modeli, ciji se parametri procenjuju na osnovu odbiraka govora (AR), glotalnog signala (X) i uticaja greske (MA). Iako se uglavnom podrazumeva da su podaci odziva takvi da imaju istu vari-jansu, ukoliko ova pretpostavka nije tacna koristi se weighted Least Squares (WLS) tehnika, kojom se procenjena greska koriguje tezinskim faktorima.

Kada je narusena ulazno-izlazna dinamika sistema, odnosno kada sistem sadrzi nelinearne komponente, koriste se nelinearni modeli kao sto je viseslojni perceptron (MLP), koji omogucuju modelovanje po pro-ceduri obucavanja koja je zasnovana na podesavanju sinaptickih tezina koje su organizovane po slojevima i medusobno povezane. MLP je Feed-Forward neuronska mreza (FNN), sto znaci da se mapiranje izvodi u smeru od ulaza ka izlazu. Parametri mreze podesavaju se propagaci-jom greske unazad (BPA) po principu pada gradijenta (GD).Za ubrzava-nje ove procedure koristi se Levenberg-Marquardt (LM) koji omogucuje smanjenje broja operacija u podesavanju parametara mreze, direktnom procenom Hessianove matrice. Trening i test greske za sve modele ko-riscene su radi poredenja dobijenih rezultata.

Linearni i nelinearni parametarski modeli

Analiza i sinteza govornog signala cesto se izvode zajedno. Anali-tickim procesom utvrduju se karakteristike izvora signala, glotisa i vokal-nog trakta. Sintezom se dobijaju signali koji mogu koristiti za prepozna-vanje govora ili govornika, simulaciju ili otklanjanje pratecih, nezeljenih efekata na sintetizovani signal. Analiza signala podrazumeva ili analizu fonetskih karakteristika ili analizu izgovorenog sadrzaja, ali je nivo gres-ke procene visok, a metodologija procene podrazumeva sirok spektar modela sa velikim stepenom slobode. Kod analize signala uvek postoji problem nepoznavanja izvora pobudnog signala, glotalnog talasa i pre-nosne funkcije vokalnog trakta. Kod sinteze signala pobudni signal na ulazu u filter za sintezu moze se podeliti na generator impulsa i generator suma ili se moze koristi pobudni signal dobijen LPC analizom govornog signala. Ova tehnika koristi se da bi bio obezbeden visok kvalitet govora, uz pretpostavku da je odbirak govornog signala linearna kombi-nacija uzastopnih, prethodnih odbiraka. Formira se linearna kombinacija n prethodnih odbiraka, a optimizacija se vrsi minimizacijom greske pre-dikcije. Dobar LP model moze biti jednostavan, a davati zadovoljavajuce rezultate i na taj nacin imati prednost nad slozenim, nelinearnim modelima. Najcesce koriscen LP model kod predikcije govornog signala je AR model. Ukoliko je u procesuiranju govornog signala dostupan i glotalni signal, moguce je formirat ARX, a generalizacija ovog modela ukljucuje i propagaciju greske, pa se primenjuje ARMAX model.

CO I

CD

C

CP

o <u <u

CP tn

M—

o c

o +-<

o

T3 <D CP

<u

-C

c o

m

c Ï?

<2 eg <u

-C

M—

o '

o ro

CP

E d

o CL

o

X

o >

LO

o (N

0¿ ÜÜ

0£ ZD O

o <

o

X

o

LU

I— >-

Q1 <

I—

CO <

-J

O ■O

x

LU I— O

O >

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Pored klasicnog LS modela koristi se Weighted Least Squares (WLS) algoritam kod kojeg tezinski faktori uticu na poboljsanje greske predikcije. Na ovaj nacin, tezinama se koriguje varijansa greske, cime se poboljsava procena parametara modela. WLS je efikasan metod koji je dobro koristiti na malom skupu podataka. U radu, WLS algoritam resava probleme konvergencije i uniformnosti.

U radu je opisana neinvanzivna metoda za snimanje signala sa glotisa koja je poznata pod nazivom elektroglotografija (EGG). Osnova metode je ispitivanje vibracija glasnih zica, merenjem impendanse kroz vrat ispitanika. Elektrode se stavljaju spolja, na larings. Kada su glasne zice zatvorene struja iz elektroda moze da prolazi kroz njih i impendansa je mala, dok je kod otvorenih glasnih zica impendansa visa. Promena impendanse ukazuje na promenu karakteristike glotisa.

Direktna opsetvacija ponasanja glotisa je teska, sto je uticalo na pojavu razlicitih racunarskih procedura koje estimiraju glotalnu pobudu na osnovu izmerenog govornog signala. Jedan od najpoznatijih modela - Strubeov model prikazan je u tekstu. Medutim, u proceni navedenih modela, glotalni signal je bio dostupan, pa je ova relacija navedena zbog primera. U radu je glotalni signal korscen kao X deo kod procenjenig ARX i ARMAX modela, kao i za obucavanje FNN.

Nelinearni sistemi mogu se modelovati dinamickom, nelinearnom, parametarskom prenosnom funkcijom. Po literaturi, FNN sa jednim skri-venim slojem i sigmoidalnim prenosnim funkcijama moze generisati re-senja kompleksnih problema kao sto su klasigikacija, prepoznavanje ob-lika i slicno, ukoliko je izbor tezina, dimenzija i pravila obucavanja ade-kvatan. Problem kod obucavanja neuronske mreze moze se posmatrati kao optimizaciona funkcija, pri kojoj tezine moraju da budu diferencija-bilne. Greska se racuna za svaku tezinu i sve slojeve ponaosob, a zatim se njihove vrednosti menjaju propagacijom unazad. Za minimizaciju greske predikcije koristi se LM algoritam. U osnovi, LM algoritam je nu-mericko resenje problema nelinearne funkcije, po vektoru parametara. Algoritam koristi dumping faktor kojim se LM priblizava Gauss-Newton-ovom (GN) algoritmu za veliki korak greske, odnosno GD za manje vrednosti greske. Vrednost Hessianove matrice racuna se iterativno, kao i vrednost inverzne Hessianove matrice.

Podesavanje parametara izvodi se u pet koraka: propagacija ula-znog signala ka izlazu, generisanje izlaznog signala na osnovu strukture mreze, proracun tezinskih matrica, odredivanje stanja za svaki cvor ponaosob i podesavanje vektora tezina unazad. Nakon sto je UI mapiranje mreze zavrseno, moze se koristit pruning, tehnika kojom se odbacuje vi-sak parametara modela.

U eksperimentima je za obucavanje AR, ARX i ARMAX modela, WLS i FNN korisceno 600 odbiraka zenskog fonema 'a'. Broj parametara primenjenih modela bio je: AR (na=25), ARX (na=14, nb=4), ARMAX

Rezultati

(na=14, nb=4, nc=1). Visoki red AR modela primenjen je da se proveri da li ima potrebe za uvodenjem glotalnog signala kod linearnog modelova-nja. Kod nelinearnih modela korisceni su isti redovi modela kao i za linearne modele, a broj ulaznih podataka odgovarao je broju ulaza u linearne modele. Prikazane su greske obucavanja i testiranja, koje ukazuju na cinjenicu da slicne rezultate daju AR i ARX modeli, WLS i ARMAX mo-deli, dok je greska na FNN znatno manja od ostalih gresaka, sto je po-sebno primetno kod test skupa.

Zakljucak

Rad predstavlja uticaj glotalnog signala na predikciju govora koja je bazirana na linearnim i nelinearnim modelima. AR, ARX i ARMAX modeli, WLS algoritam i FNN korisceni su u predikciji. Modeli su obucavani na vokalu 'a' koji je izgovorila zena tokom normalne fonacije. Za obucava-nje BPA je koriscen za podesavanje parametara modela. Promena pa-rametara izvedena je propagacijom po pravcu negativnog gradijenta, za minimizaciju funkcije greske. LM algoritam, koji je koriscen da ubrza i olaksa izracunavanje Hesianove matrice, pokazao je znacajne prednosti nad GD algoritmom. LM kombinuje minimizaciju po pravcu negativnog gradijenta i Njutnov metod.

Komparativna analiza koja je zasnovana na trening i test greskama pokazuje da AR model sa velikim brojem parametara i WLS algoritam, koji su bazirani iskljucivo na govoru, daju vecu gresku ukoliko se uporede sa ARX i ARMAX modelima, kod kojh glotalni signal utice na predikciju. Tre-ning greske pokazuje da je uticaj glotalnog signala veci u fazi otvorenog glo-tisa. ARX modeli i WLS poboljsavaju predikciju i znatno redukuju gresku. Rezultati, takode, ukazuju na vecu tacnost, odnosno minimum greske za FNN. FNN sa jednim skrivenim slojem i tanh aktivacionim funkcijama svih neurona pokazuje da njeno ulazno-izlazno preslikavanje moze preciznije da prediktuje govorni signal od svih drugih modela.

Na osnovu svega sto je ranije izneseno, moze se zakljuciti da, ukoliko je glotalni signal dostupan, FNN treba koristiti kad god je to moguce, zbog preciznosti procena, iako je osetljivost modela povecana, a vreme obuca-vanja traje duze. Ipak, ukoliko to nije slucaj, AR modeli visokog reda mogu biti zamena za ARX ili ARMAX modele. Obucavanje WLS pokazuje malu trening gresku. Medutim, kod testiranja greska izuzetno raste, pa modele zasnovane na WLS ne bi trebalo koristiti u ove svrhe.

Kljucne reci: linearni modeli, predikcija, glotalni signal, feed-forward neuronska mreza, govor.

Datum prijema clanka/Paper received on: 25. 06. 2014.

Datum dostavljanja ispravki rukopisa/Manuscript corrections submitted on: 15. 08. 2014. Datum konacnog prihvatanja clanka za objavljivanje/Paper accepted for publishing on: 17. 08. 2014.

CO I

CD

C

CP

o <u <u

CP tn

M—

o c

o +-<

o

T3 <D CP

<u

-C

c o

m

c Ï?

<2 eg <u

-C

M—

o '

o ro

CP

E d

o CL

Impact of the glottal signal on the prediction of speech Текст научной статьи по специальности «Компьютерные и информационные науки»

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Protic Danijela D.

Похожие темы научных работ по компьютерным и информационным наукам , автор научной работы — Protic Danijela D.

Текст научной работы на тему «Impact of the glottal signal on the prediction of speech»