Научная статья на тему 'Statistical model of time series'

Statistical model of time series Текст научной статьи по специальности «Медицинские технологии»

CC BY
79
18
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
ФАКТОРНЫЙ АНАЛИЗ / КЛАСТЕРНЫЙ АНАЛИЗ / МЕТОД МАКСИМАЛЬНОГО ПРАВДОПОДОБИЯ / ПУПИЛЛОГРАММА

Аннотация научной статьи по медицинским технологиям, автор научной работы — Garbaruk V.V., Fomenko V.N., Kupriyanov A.S.

Railway obligatory medical inspections are aimed at enhancing safety of train traffic as well as maintaining staff health and employability. In this work a method is suggested of rapid identification of drug dependent people by means of pupillograms (i. e., curves representing the time dependence of human pupil area as reaction to light stimulus). These curves are processed using factor analysis, data points being treated as variables. Correlation between data points appears as presence in each of them factors common for the whole process. Clustering experimental data is done using test of statistical significance. Two criterion variables are introduced: one of them depends on common factors whereas the other is related to individual factors. A class of pupillograms obtained for survey subjects having some pathologies has been revealed as a result of analysis.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Statistical model of time series»

Statistical Model of Time Series

Garbaruk V. V., Fomenko V. N., Emperor Alexander I Petersburg State Transport University, St. Petersburg, Russia, vmkaf@pgups.ru

Kupriyanov A. S., OOO «T-Sistems RUS», St. Petersburg, Russia, akuprijob@rambler.ru

Abstract. Railway obligatory medical inspections are aimed at enhancing safety of train traffic as well as maintaining staff health and employability. In this work a method is suggested of rapid identification of drug dependent people by means of pupillograms (i. e., curves representing the time dependence of human pupil area as reaction to light stimulus). These curves are processed using factor analysis, data points being treated as variables. Correlation between data points appears as presence in each of them factors common for the whole process. Clustering experimental data is done using test of statistical significance. Two criterion variables are introduced: one of them depends on common factors whereas the other is related to individual factors. A class of pupillograms obtained for survey subjects having some pathologies has been revealed as a result of analysis.

Keywords: factor analysis, cluster analysis, maximum likelihood method, pupillograms.

Introduction

A is known the human eyes reflect the psychoemotional and physiological state of men. A number of researches were performed earlier which proved the relation of pupil characteristics to the state of tiredness. For instance, correlation between pupil contraction rate and the state of being without sleep for more than 24 hours has been demonstrated in report [1]. Thus medical examination based on pupillograms can be used to assess one's attentiveness and reaction rate as well as his validity. A pupillo-gram examination lasts 5 minutes, is non-invasive and harmless to patients [2]. For this reason the pupillogram examination is fit for institutions which must permanently monitor staff efficiency and find out persons being in drug or alcohol intoxication. In the present paper pupillograms are treated as time series. Their statistical processing allows one revealing facts of a psychotropic substance exposure upon the subject under examination.

An important task in using time series is classifying them into classes according to a specific criterion. Existence of an appreciable stochastic component in the time series disguising the classifying criterion can greatly complicate solution of the above-mentioned problem. The goal of the present paper is to suggest a method which, in a number of cases, enables one to eliminate the influence of the stochastic factor and more clearly distinguish an interesting feature of the time series. The method proposed is based on factor analysis [3-5] which allows extracting common components in data points of a time series, so-called common factors. All data points are expressed through common factors. The set of n random variables fill a domain in n dimensional space which is segment of a linear manifold with dimensionality equal to the number of common factors. Strong correlation between data points ensures that the number of common factors is not large. In such case the stochastic component in the process under investigation can be removed to a large extent and reliable conclusions can be drawn on whether the time series belongs to a specific class.

The present approach to take into account correlation between data points is an alternative to autocorrelation function method [6]. Its advantage is that it does not imply transition into frequency domain and, as a result, loss of clearness. Moreover, a non-stationary process is admissible too. In conventional approach the time-frequency transform would be necessary in that case what causes further difficulties.

Factor analysis as applied to stochastic processes

Let X(t) be the random process to be investigated observed at time moments t, (i = 1, 2, ..., n). Denote the corresponding process values (further on referred to as terms of the time series) as x. = X(t). The quantities mi = Mxi and ai = -JDxi stand for the expected values and rms deviations. Let R be the correlation matrix for random variables xi.

In factor analysis the matrix R is decomposed with respect to its eigenvectors as follows

R = V

i% 0 0 x2

0 ï

0

0

x.

VT

where V denotes a matrix with the eigenvectors of R as its columns, X{ are eigenvalues > X2 > ••• >Xn) • Let us choose the firstp eigenvectors from their complete set so that they make

n

a dominant contribution to the sum ^ Xk = n.

k=1

The following matrix is computed starting from the selected eigenvectors

R = V

(X1 0

0 x2

0 ï

0

0

x

V

(1)

p 7

where V has dimensions n x p and contains the chosen eigenvectors. Further, the eigenvectors of R are found, p ones of them with the greatest eigenvalues are selected, a new matrix R is calculated via Eq. (1) on their basis and so on. The iterations are repeated until convergence with desired accuracy is achieved. As a result the initial correlation matrix takes the following form

R = R + AR.

(2)

The diagonal elements hi = Ru, which are portions of the total variance ai of the x_ variables reproduced by the p main eigenvectors of R , are called communalities. They may essen-

Intellectual Technologies on Transport. 2017. No 3

tially differ from 1. The matrix AR represents residual correlations and variances.

Splitting the correlation matrix into two parts (2) is associated with the following representation of the x. variables

xi = mi + ai Z aikfk + ui > i =1 n :

k=1

(3)

where / are non-correlated normalized random quantities ("common factors"). The coefficients aik ("factor loadings") are

collinear with the eigenvectors of R with a factor of . The quantities ui ("individual factors") are statistically independent from the common factors.

The diagonal elements of AR have the following form

Du-

1 - hi =-- . They might not be small as compared to 1. How-

ai

ever, in accordance with the principles underlying factor analysis, the non-diagonal correlation coefficients of AR should be small because common factors have to take into account the main part of x. correlation:

C°v(xi : xj) r—— P Cov(ui: uj)

■ = -\jh1 j Z aikajk '

aia j

k=1

aia j

If communalities hi are about 1 (i. e., the individual fluctuations are sufficiently small), one can omit the individual corrections in Eq. (3) to a good accuracy. Then one obtains for terms of the time series the following approximate representation

xi ~ mi + ai Z aikfk. k=1

(4)

Eq. (4) means that the vectors (xi, X2,..., xn) are contained by a n dimensional domain (linear manifold) in the n dimensional linear vector space. The reduction of dimensionality of the space segment filled by the stochastic process is obviously related to correlations between data for different time moments. These correlations are described in factor analysis by introducing components common for the different terms of the time series, the common factors.

between the individual factors can lead to correlation coefficients far from zero because the variance of the individual factors are small too if the communalities are near unity. Residual covari-ance is defined as the difference between total covariance and its portion related to the common factors

Cov(u. , Uj ) = Cov( xi, xj ) - Cov j =

p

= Cov( xi: xj) - a a j Z aikajk.

k=1

The matrix C of coefficients entering Eq. (5) equals C =

II II"1

= Cov(u(-, uj ) .

The estimators of the factors fk and uii are obtained as values maximizing likelihood (5) under additional conditions

max ¥( /l,•••, fp: un)

p

xi = mi+Z aikfk +,. =1., n, k=1

where x. are experimental data.

One obtains the following system of equations for u. using Lagrange multipliers:

Z

j=1

pn

■ai ZZ aikaik aiCij k=1l=1

u j = xi- mi.

(6)

The common factor estimators are gotten from the relations

n

fk = Z Cijuj aiaik. (7)

i, j =1

However, it may occur to be rather cumbersome to straightforwardly apply formulae (6) and (7) because this would require to successively invert two large (possibly badly conditioned) matrices in order to compute Cj and then U.. The difficulty is overcome by passage to new variables w. instead of U. in the following manner

= Z Cov(ui: uj) • wj.

(8)

j =1

Point estimation of factors

Let us construct the point estimators of common and individual factors using the maximum likelihood method. To make the matter more explicit we assume that factors are distributed normally. As a result, the terms of the time series obey the same distribution rule. Then, having in mind that common factors are uncorrelated with each other and the individual ones, the likelihood can be written down as follows

¥( /1, — : fp: u1,"-: un) =

= N exp

1 P 2 1

- 2 Z fk - 2 Z Cijuiuj

2 k =1 2 i j =1

• i, j=1

(5)

with N as a normalizing factor. In Eq. (5) we have taken into account correlation between the individual factors. It must be done despite of the fact that the common factors do care for an important part of xi correlations. Really, residual correlation

The wi variables are calculated from the system of equations

Z Cov( xi: xj )wj = xi- mi.

(9)

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

j=1

The common factors are expressed through the new variables in the following way

fk = Z wiaiaik. i=1

(10)

It is to be noted that the covariance matrix Cov(u., Uj) may be even singular rather than badly conditioned. It is the case if exactly eigenvectors of the correlation matrix R are chosen as factor loadings (for instance, it is done so in the Principle Component Method). The C matrix in the expression for likelihood (5) does not exist in this case. Never the less formulae (8)-(10) which were used for computation of point estimators are still val-

id. They can be justified by a regularization procedure implying transfer of some amount of covariance from the individual factors to the common ones according to the following relations

Cov( f= Cov( f) + eE; ||Cov(m, , uj )||' = | |Cov(u,-, uj )|| - eE,

where E is identity matrix and e is a small parameter. The new matrices are not degenerate and they allow obtaining formulae (8)-(10). Passage to the limit e^0 is obvious because covariance Cov(ui, uj) is continuous with respect to e.

Clustering of stochastic processes

When operating with time series it is important to know if the current stochastic process represented by experimental data belongs to a process class. Let us assume that the following characteristics of the class are available: expected values and rms deviations (in fact, statistical estimates of them) and factor loadings of the time series terms. Here we solve the class membership problem using the test of significance algorithm and taking into consideration two reasons. The rms deviation of the observed values relative to their approximation through common factors should not be large if the process belongs to the given class and factor analysis performed earlier has shown a considerable contribution of common factors to correlation between terms of the time series. second, the magnitudes of common factor minimizing the rms deviations should not considerably exceed unity because common factors are normalized random variables.

Likelihood (5) prescribes distribution of two following statistics

X2 = X Quu

tj t J '

j

and

x 2 = z f2

k=1

(11)

(12)

Let us dwell on the first statistic. Denote as D the matrix with orthonormal eigenvectors of the matrix C as its columns. since C = Cov-1, the quadratic form (11) is positive definite, because the covariance matrix has positive eigenvalues. Let ui' be variables in terms of which the form (11) is diagonal:

= XKiUi ' i=1

(13)

Here k(- are eigenvalues of C. We have for the ui' variables

n

ui' =Z Djiuj. j=1

2 2

It follows from expression (13) that the xu statistic has a % distribution with n degrees of freedom. The variable (13) can be cast in the form

x2 = XK

i=1

n p

X Dji( xj - mj - a j x ajkfk) ,j=1 k=1

This relation allows treating as the weighted rms deviation of orthonormal linear forms of the terms of a time series

n

X Djjxj from the values predicted by common factors

J=1

np

(mj + aJ X ajkfk). k=1

j=1

As was said above, computation of the coefficients Cy can turn out to be difficult. One can avoid this problem by expressing statistic (11) through variables wi (see Eq. (8))

x2 = X wi Cov(u, uj )wj-

i, j =1

(14)

Formula (14) is also valid if the matrix Cov(u,-,uy) is degenerate. It is derived in that case by using the regularization procedure described in section 2.

Statistic (12) has a x distribution withp degrees of freedom because of independence of common factors.

Let Pu and Pf be confidence levels for both test statistics

with F(2*^ (x) being cumulative %2 distribution function with r

x

degrees of freedom. Then the tolerance domain of test of significance is determined by the system of inequalities

|X2 < K ■

KU ^ ^u '

[X f

< K

(15)

f

where quantiles Ku and Kf obey to equations

?(n).

- x2'(Ku )=pu; Fx(p}( Kf )=Pf.

An application of the outlined method

The method outlined above has been applied to investigation and clustering of the set of pupillograms, which give the dependence of pupillary diameter on time after a light stimulus. The consecutive frames of pupil are produced during 3 seconds at a rate of 50 frames per second. A set of 544 pupillograms of healthy people was investigated, each pupillogram containing readings for 155 time moments. A generic pupillogram is depicted in Fig. 1. The light flash takes place at zero time. One can learn more about pupillography in paper [7].

A factor analysis procedure has revealed three common factors with individual communalities not less than 0.99. Data were analyzed using criteria (15) with confident levels Pu = Pf = 0,95.

In Fig. 2 open circles depict the points with abscissas and

22

ordinates equal to the values of statistics xu and xf, correspondingly. The dashed lines show the critical values of Ku and Kf. These lines confine the tolerance rectangular domain. From the total number of 544 pupillograms 22 ones go beyond the tolerance domain being positioned near its boundary. That is in accordance with the confident levels adopted. 109 pupillograms of healthy people are only shown to make the figure clearer.

A statistical model has been built as described above on the basis of the normal pupillograms, i. e. those of healthy people. This model is used to detect pathologies of drug dependent patients suffering from intoxication and drug withdrawal. In Fig. 2 upright crosses depict points for pupillograms of peoples after drug withdrawal (totally 30 patients). The points shown as

Time, sec Fig. 1. A generic pupillogram

Fig. 2. The critical values of Ku and K/

slanted crosses relate to pupillograms of people with intoxication (20 surveyed persons). It is seen that 19 pupillograms from 20 ones made in condition of intoxication do not meet the tolerance domain. All 30 pupillograms of people after drug with drawal are far beyond the tolerance domain either.

Thus, one can certainly draw the conclusion that the method of identification of the specified pathologies suggested in this work is efficient. Note that both pathologies are well separated. However, creating statistical models for pupillograms with a specific pathology is necessary to completely solve this problem. Unfortunately, relevant experimental data are not available at the moment.

Concluding remarks

The approach to stochastic processes proposed in this work and based on factor analysis has the advantage that it can be used both for stationary and non-stationary time series and allows one avoiding time-frequency transformation. It is in contrast with the

Fourier transform method that information on the process class as a whole is contained in the factor loadings whereas information on a specific realization of the process is related to common factors which are normalized independent random variables. It allows mapping individual features of a process realization into a linear manifold with dimensionality which is equal to the number of common factors and therefore may be much less than dimensionality of the space spanned by experimental data.

One can conclude that the method suggested allows performing discriminant analysis of random processes. The example given in this paper proves its efficiency.

Statistical simulation of stochastic processes with preset characteristics could be another application of the method.

References

1. Dal Santo J. P., Tousman S. A., Shaw D. L., Gerstein J. K., Falzone J. J., Dal Santo N. J. Pupillometry & Trucking Fatigue, 22nd Pupil Colloquium, Atalanta, 1997, pp. 62-68.

2. Velkhover E. S., Ananin V. F. Pupillodiagnostics [Pupil-lodiagnostika], Moscow, UDN, 1991, 212 p.

3. Harman H. H. Modern Factor Analysis, Univ. Chicago Press, 1976, pp. 175-176.

4. Fabrigar L. R., Wegener D. T., MacCallum R. C., Stra-han E. J. Evaluating the Use of Exploratory Factor Analysis in Psychological Research, Psychol. Methods, 1999, no. 4 (3), pp. 272-299.

5. Subbarao C., Subbarao N. V., Chandu S. N. Characterisation of Groundwater Contamination using Factor Analysis. Environ. Geol., 1996, no. 28 (4), pp. 175-180.

6. Percival D. B., Andrew T. W. Spectral Analysis for Physical Applications: Multitaper and Conventional Univariate Techniques, Cambridge Univ. Press, 1993, pp. 190-195.

7. Fomenko V. N., Kypriyanov A. S. Matimatical Models of Pupillary Reaction of Human Eye (Pupillograms) [Matemat-icheskie modeli zrachkovyh reakcij glaza cheloveka (pupillo-

gramm)]. Proc. Petersburg Transp. Univ. [IzvestiiaPeterburgsko-go universitetaputei soobshcheniia], 2010, Vol. 4, pp. 220-230.

8. Dubrov A. M., Mhitarjan V. C., Troshin L. I. Multivariate Statistical Methods [Mnogomernye statisticheskie metody], Moscow, Finansy i Statistika, 2003, 352 p.

9. Zuev D. V. Synthesis of Object-based Neural Network of Image Recognition and its Application for Railway Automation Tasks [Sintez ob"ektnoj nejrosetevoj modeli raspoznavaniya obrazov i ee primenenie v zadachah zheleznodorozhnoj avtoma-tiki], St. Petersburg, 2013, 122 p.

10. Zakharov A. I., Zagaynov A. I., Khodakovsky V. A., Mul-tifractal Analysis: Identifying the Boundaries Application in the Study of Time Series, Intellectual Technologies on Transport, 2015, no. 3, pp. 24-29.

11. Nivorozhkina L. I. Multivariate Statistical Methods in Economics [Mnogomernye statisticheskie metody v ehkono-mike], Moscow, Infra-M, RIOR, 2017, 203 p.

Статистическая модель временных рядов

Гарбарук В. В., Фоменко В. Н., Петербургский государственный университет путей сообщения Императора Александра I, Санкт-Петербург, Россия, vmkaf@pgups.ru

Куприянов А. С., OOO «Т-Систем РУС», Санкт-Петербург, Россия, akuprijob@rambler.ru

Аннотация. Обязательные медицинские осмотры на железнодорожном транспорте проводятся с целью медицинского обеспечения безопасности движения поездов, сохранения здоровья и трудоспособности работников. В работе предложен метод быстрого определения наркозависимых людей по пупил-лограммам (кривым зависимости площади зрачка человека от времени в ответ на стимул). К таким кривым применяется факторный анализ, причем в качестве переменных рассматриваются временные сечения пупиллограмм. Корреляция этих сечений проявляется через наличие в них общих для всего процесса факторов. При кластеризации экспериментальных данных используется критерий значимости, причем вводятся два критерия: один содержит общие, а другой - индивидуальные факторы. В результате анализа четко выделяется класс пупиллограмм, полученных у обследуемых с патологиями.

Ключевые слова: факторный анализ, кластерный анализ, метод максимального правдоподобия, пупиллограмма.

Литература

1. Dal Santo J. P. Pupillometry & trucking fatigue / J. P. Dal Santo, S.A. Tousman, D. L. Shaw et al. // 22nd Pupil Colloquium. -Atalanta, 1997. P. 62-68.

2. Вельховер У С. Пупиллодиагностика / У С. Вельховер, В. Ф. Ананин. - М.: УДН, 1991. 212 с.

3. Harman H. H. Modern Factor Analysis / H. H. Harman. Chicago: Univ. Chicago Press, 1976. P. 175-176.

4. Fabrigar L. R. Evaluating the Use of Exploratory Factor Analysis in Psychological Research / L. R. Fabrigar, D. T. Wegener, R. C. MacCallum, E. J. Strahan // Psychol. Methods. 1999. № 4 (3). P. 272-299.

5. Subbarao C. Characterisation of groundwater contamination using factor analysis / C. Subbarao // Environ. Geol. 1996. № 28(4). P. 175-180.

6. Percival D. B. Spectral Analysis for Physical Applications: Multitaper and Conventional Univariate Techniques / D. B. Percival, T. W. Andrew. - Cambridge: Cambridge Univ. Press, 1993. P. 190-195.

7. Фоменко В. Н. Математические модели зрачковых реакций глаза человека (пупиллограмм) / В. Н. Фоменко, A. С. Куприянов // Изв. ПГУПС. 2010. Вып. 4. С. 220-230.

8. Дубров А. М. Многомерные статистические методы / А. М. Дубров, В. С. Мхитарян, Л. И. Трошин. - М.: Финансы и статистика, 2003. 352 с.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

9. Зуев Д. В. Синтез объектной нейросетевой модели распознавания образов и ее применение в задачах железнодорожной автоматики: дис. ... канд. техн. наук: 05.13.18 / Д. В. Зуев. - СПб., 2013. 122 с.

10. Zakharov A. I. Multifractal analysis: identifying the boundaries application in the study of time series / A. I. Zakharov, A. I. Zagaynov, V.A. Khodakovsky // Интеллектуальные технологии на транспорте. 2015. № 3. С. 24-29.

11. Ниворожкина Л. И. Многомерные статистические методы в экономике / Л. И. Ниворожкина. - M.: Инфра-М; РИОР, 2017. 203 c.

i Надоели баннеры? Вы всегда можете отключить рекламу.