Научная статья на тему 'Descriptive analysis of matrix-valued time-series'

Descriptive analysis of matrix-valued time-series Текст научной статьи по специальности «Медицинские технологии»

CC BY
124
52
i Надоели баннеры? Вы всегда можете отключить рекламу.

Аннотация научной статьи по медицинским технологиям, автор научной работы — Antille Gerard

In this article we present a technique of data analysis applied to three-dimensional tables as, for instance, matrix-valued time-series. The main goal of the method is to describe the evolution of the statistical units with respect to time in a space summarizing the set of matrices. Moreover, our technique points out similar statistical units provided by a classification of their trajectories.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Descriptive analysis of matrix-valued time-series»

№4(8) 2007

G. Antille

Descriptive Analysis of Matrix-Valued Time-Series

In this article we present a technique of data analysis applied to three-dimensional tables as, for instance, matrix-valued time-series. The main goal of the method is to describe the evolution of the statistical units with respect to time in a space summarizing the set of matrices. Moreover, our technique points out similar statistical units provided by a classification of their trajectories.

Large data sets are common in most sciences. In particular, factor analysis is widely used to study a set of observations of p variables measured on n statistical units. Nevertheless, few methods [Kroonenberg (1983)], [Escoufier (1985)], [Casin (1995), (1996)] are developed to analyse repeated observations of n x p matrices. Without loss of generality we will consider repetitions of these matrices only over the time.

As a global approach we conduct a three-steps analysis of this type of data:

1. Analysing each data matrix to get an idea of the data structure at time t, t e {1,...,T};

2. Constructing and analysing the Txp matrix whose rows contain the means or medians of each data matrix to get an idea of the global evolution of the process under study;

and, finally,

3. Finding a common space to describe the evolution of statistical units and relationships between variables with respect to time.

In the article we have mainly developed the third point with techniques based on Principal Component Analysis (PCA). PCA is also a way to perform the above mentioned first and second steps.

The concern of Section 2 will be the construction of the common space, on which the data sets will be projected. The choice of that space is based on an optimality criterion applied to measures of dispersion. In Section 3 we present a descriptive method of analysis of matrix valued time series which consists in projections, with respect to time, of statistical units or variables on principal directions of the common space. We call these projections trajectories, and we classify them to exhibit similarities. In Section 4 we apply our method to a 26 x 8 x 26 matrix: 26 years of observations of 8 variables of rates of mortality in 26 Swiss cantons.

We define a data set as a three-dimensional matrix denoted byK = (xjt),=1.....n;j=1.....p;t=1.....T,

where xjt represents the value of the jth variable for the ith statistical unit at time t.

Another way to define such a data set is given by K = {Xt|t = 1,...,T}, where Xt is the nxp matrix of observations at time t.

The first step in the analysis of our data sets is to perform PCA on the T matrices Xt to explore the structure of each matrix in order to point out important changes in the structure of the data. In this article we do not discuss this type of question.

1. Introduction

45

№4(8) 2007

The second step consists in applying PCA to Xu, the T x p matrix of the means over statistical units, which tth row equals (x. jt) j=1,...,p .The goal of this PCA is to summarize and exhibit the global evolution of the T matrices in subspaces generated by principal axes of Xu.

In the third step, PCA is conducted on X = (x ), the n xp mean matrix over the time, to define directions of projection of statistical units or variables to study their evolution with respect to t. We remark that if T- 1, there is no common space to define, and usual PCA provides exactly what we search for. Let us recall that the optimum properties of PCA are direct consequences of the pro-

v'Vv

perties of the Rayleigh quotient rV(v)-—-—, where V is the variance-covariance matrix related to X. vv

Hence the concern of this article is to extend the PCA approach for the cases when T > 1.

In this context the following question arises: how to define V, when T matrices of size n x p have to be analysed simultaneously, and the common space will be defined through a criterion based on a ratio similar to rV (v).

2. General Framework

Given a data set K there exists at least four ways to define the matrix V with respect to X, a matrix constructed by means of the set of matrices Xt.

Let us consider 1'- (1 1 ... 1), the identity matrix I, and XC - (I- 11'/n)Xt. To simplify the notation let Xt - Xct and Vt - (Xct )'Xct.

1. x = (x, x2

2. X = (X1 X2

X,

W =

X,X 1 X,X 2 X 2 X1

Xt X1 XT X 2

X1X7 ^

XT XT J

w

с

<u

I-

T3

<u

is

.sj

3. X =

(X1 0 0 X2

0 0

4. X = £ ctXt,

0 0

X

T

w3 =

( x tX 1 0

0

X 2 X 2

0 0

v

W4 kjX TX,, k ,j= cc_

0 0

XT XT J

о с

.ss

u «

<u Cl

Given V, letr= (v,,v2,... ,vr} be a set of rorthogonal vectors and let us define9(T, V) rV(vk) as the sum of the squared lengths of the projections of the rows of X on vk ,k - 1, 2,...,r. In cases 1 and 4 v k and 9 defines a global measure of dispersion of the data set captured by the space generated by the vectors ofT.

At any time t, 9 (T, Vt) is the dispersion of the data matrix Xt captured by the space generated by T. As

9(T, W,) =X>(T, Vt),

46

nPHKJIAJJHAff 3K0H0METPHKA_^^

No4(8) 2007

1

—9(T) measures the mean dispersion of the data set captured by the space generated by the | T c

set T. *

Cases 2 and 3 are not relevant to our problem as the dimension of the projection directions do not match the original data. In Case 4 we have

W4 c2V +X cc;X X,

t i *;

and we can write

9(T, W4) = £ Ct29(T, V) +X c, C;9(T,X'X;),

t i *;

which shows that the dispersion captured by T is composed of a within each year part and a between every couple of years part.

Moreover, if 1

Ct = 1

then ( ^

9(T, W4) = rf

9(T,W,) XX)

As in the standard PCA we can consider a dual approach based on the study of the columns instead ofthe rows of the data matrices, and we define for A= [uuu 2,...,us}, a set of orthogonal vectors, y(A,V') = ^rv (uk).

In that framework we should solve the following optimisation problem (OP):

maxm(T,V) and maxw(A,V')

^ A

to get the directions of projections of the statistical units and the variables on their respective optimal subspace.

In fact, essentially for computational reasons we consider a slightly different optimisation problem and we solve OP sequentially. We start to solve OP with T and A containing one vector. Then we solve OP, with T and A still containing only one element, under the constraint of orthogonality of that solution to the previous one and so on.

It is well known that such solutions are given by the singular value decomposition ofthe above given X matrix. The obtained eigenvectors generate the common spacesTc and Ac.

Classical methods as generalized PCA, generalised canonical analysis, or STATIS method fit in that general framework [Antille (2001)].

3. Trajectories

In the previous section we proposed a way to obtain a common space to represent the data

set.

In the common spaceTc, at anytime t, the statistical units have their respective positions given by XtD, where D is the pxr matrix of eigenvectors of X'X. Similarly, at any time t, X'G gives the coordinates ofthe variables in A c, where G is the n x r matrix of eigenvectors of XX'; and where r, with r < rank(X), is equal to the number of eigenvectors we select to describe the data.

-V

PoccMMCKO-wBe^apcMM ceMUHap no эконометрмке u CTaTMCTMKe

47

.s

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

о

,<e £

« с

.ss

No4(8) 2007

A k-dimensional trajectory of the ith statistical unit with respect to time is defined by Pi - p 1; pi2;...; pT ] where pit - Xt(i) D (k), X\i] is the ith row of Xt and D (k) isa p x k sub-matrix of D; the coordinates of the tth vertex of that trajectory are given by ptt. As k can be seen as a degree of freedom left to the analyst there are as many as 2k trajectories;some of them being interesting for graphical purposes, others for clustering statistical units.

Among the one-dimensional or two-dimensional trajectories only those corresponding to the largest singular values have statistical interpretation. The one-dimensional trajectories should be plotted versus the time, and graphical comparisons are quite easy (see Figures 4 and 5). The two-dimensional trajectories have to be plotted in the plane generated by two chosen principal axes (see Figure 3)

In order to compare trajectories we propose two distinct points of view, a location and an evolution one.

In the location approach, comparisons are based on distances between vertices of trajectories i and j, defined by

up

dp(i,j) -||p, -pj||p -X||p* -pfi\ t

| ||p being the Lp — norm. In this case two trajectories are equal if they match exactly.

In the evolution approach, proximities are based on distances between etq, lags of order q for the trajectory i, and ef — lags for j, where

e'q - p K - pt-q, t - q +1.....T, 1 < q < T -1,

with too large values of q being meaningless. In this case two trajectories are similar if they are linked by a translation.

As it can be easily seen, graphical presentations of trajectories are often useless as there are too many overlappings or simply too many trajectories on the plot. Clustering trajectories provide a way to detect similar statistical units with respect to the defined principal axes.

« 4. Application

As an illustration of our descriptive approach to analysing matrix-valued time-series we study the evolution of 8 mortality causes in the 26 Swiss cantons during 26 years. As there exist important geographical, economic, and cultural differences between cantons we expect to point out these differences by analysing our set of data, a 26 x 8 x 26 matrix. Moreover, sizes of the popu-S lation are also very different, so we have to consider the percentages of death due to infection, tumour, diabetes, hart disease, respiration, accident, suicide, and others for each canton. To construct the common space we choose the mean matrix over the time and perform a PCA on that matrix after standardization. In that case the Kaiser criterion implies that only the first three components are interesting as they capture 74.11% of the total dispersion. Eigenvectors, correlation with the axes, and contribution of the variables to the construction of the axes are given in Table 1. As computation was performed on a standardized matrix, correlations are equal to the coordinates of the variables on the corresponding axis. Figure 1 contains the representation of the variables <3 in the common space of dimension two. As it can be seen, the first axis opposes heart disease to q infection and other causes of death, and the second axis opposes accident to tumour and diabetes.

48

Eigenvectors, correlation, and contribution ofthe variables

Ns4(8) 2007

Table 1

CP1 CP2 CP3

infection -.45 -.B1 19.9 .13 .16 1.64 .03 .03 .06

Tumour -.34 -.61 11.27 -.49 -.60 24.37 .25 .26 5.97

Diabetes .30 .54 B.72 -.37 -.45 13.57 .44 .47 19.00

Heart disease .52 .94 26.71 .14 .17 2.05 -.16 -.1B 2.72

Respiration -.26 -.47 6.B2 -.25 -.30 6.1B -.19 -.20 3.50

Accident -.14 -.25 1.B5 .70 .B4 47.63 .15 .16 2.14

Suicide -.10 -.17 .93 -.17 -.21 2.9B -.B0 -.B7 65.24

Others -.49 -.BB 23.B .13 .15 1.5B .12 .13 1.37

с CS

The locations ofthe cantons (the list of abbreviations is provided in the appendix to the paper) on the first PCA plane provide information on the similarities of causes of death;for instance, on the first axis we observe that GE and VS have the highest rate of mortality due to infection as pointed out on Figure 1.

The main interest of this descriptive method is due to the possibility of following graphically the evolution of cantons with respect to time and making comparisons. The weakness ofthe method is a large amount of information, which could be included in such projections. Drawing all the trajectories does not make sense as, usually, they will overlap, and the figure will be almost covered by lines. Drawing trajectories for a few cantons, which seems similar on the compromised

1

0.8 0.6 0.4 0.2

<

o

Q.

E o

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

First PCA

Figure 1. Plot ofthe variables on the first PCA plane

49

№4(8) 2007

VS - FR ZG JU GR k NW OW AI UR SZ SG TG

GE VD NE BL BS i i i i ZH LU GL AR AS1 BE SH so 1 1 1

-5-4-3-2-10 1 2 3 4

First PCA

Figure 2. Plot of the cantons on the first PCA plane

space, allows exhibiting differences as shown on Figure 3 for GE and VS on the first PCA plane. Drawing one-dimensional trajectories, as for examples first PCA versus time or second PCA versus time, is sometimes more informative (as shown on Figure4and Figure 5, againforGEand VS.) Differences are obvious.

w

с

<u

<L

i-

T3

<u

3

s

.35

о &

с

s

u «

<u CJ

GE VS

0.4 c. ,n_ 0.6 First PC

Figure 3. Trajectories of GE and VS on the first PC plane

These plots provide information to compare cantons or, more generally, to compare statistical units.

Plots of variables versus time allow observing the evolution of causes of death as shown on Figure 6, where we see the increase of diabetes. The trajectory of the infections changed sharply around 1985, when the AIDS started to be counted as an infection.

50

№4(8) 2007

с

ч

сз

GE1 VS1

3 5 7 9

Figure 4. Trajectories of GE and VS, first PC versus time

—■— GE2

--VS2

Figure 5. Trajectories of GE and VS, second PC versus time

-♦— diabetes □ infections

Time

Figure 6. Diabetes and infections, first PC versus time

For general comparison we suggest using classification of trajectories. Figure 7 presents a classification tree ofthe cantons performed on Euclidean distances between two-dimensional trajectories ofthe cantons with respect to the first PCA plane. As it can be seen, the structure provided by the classification is close to the latent structure shown on Figure 2. This fact is explained by the repartition of causes of death, which was almost stable during the observation period as the structure reflected by time is similar to the one given by the common space.

\

51

№4(8) 2007

100

90

80

70

аэ

с 60

о

"К 50

09

О) < 40

30

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

20

10

0

i

Ь_о

ñ

8 21 3 19 1 2 11 14 5 15 17 4 6 7 16 20 9 26 18 10 22 24 12 13 23 25 GL Tl LU AG ZH BE SO SH SZ AR SG UR OW NW Al TG ZG JU GR FR VD NE BS BL VS GE

Figure 7. Classification tree of cantons based on two-dimensional trajectories 5. Conclusion

In this article a descriptive method of analysing three-dimensional matrices is presented. Principal component analysis of a matrix, summarizing the data, provides directions of projections of statistical units for construction of their trajectories with respect to time. Plots, clustering methods, and classification trees of trajectories allow comparison of the evolution of the units. For the variables a dual approach can be performed, and comparisons with respect to time are possible.

Appendix

The following table contains the list of abbreviations of the Swiss cantons.

Table 2

1. Zurich ZH B. Glarus GL 15. Appenzell R AR 22. Vaud VD

2. Berne BE 9.Zug ZG 16. Appenzell I AI 23. Valais VS

3. Lucerne LU 10. Fribourg FR 17. St Gallen SG 24. Neuchâtel NE

4. Uri UR 11. Solothurn SO 18. Grisons GR 25. Geneva GE

S. Schwytz SZ 12. Basel-City BS 19. Aargau AG 26. Jura JU

б. Obwalden OB 13. Basel-L BL 20. Thurgau TG

7. Nidwalden NI 14. Schaffhausen SH 21. Ticino TI

«

с

<u ¿

.£ I-

T3

<u

3

о

,<e «

с ч

s

u <л <u Cl

References

AntilleG. Analyse de la composante chronologique dans les tableaux croisés// Revue suisse d'économie et de statistique. 2001.

Casin Ph. L'analyse discriminante des tableaux evolutifs//Revue de Statistique AppliqueeXLIII. 1995. Casin Ph. L'analyse en composantes principales generalisee //Revue de Statistique Appliquee XLIV. 1996. EscoufierY. Objectifs et procedures de l'analyse conjointe de plusieurs tableaux de donnees// Statistique et Analyse de donnees. 1985. № 10.

Kroonenberg P. Three mode principal component analysis. Theory and applications. Leiden, reprint. 1983.

62

i Надоели баннеры? Вы всегда можете отключить рекламу.