A method for building a forecasting model with dynamic weights

Sineglazov V.; Chumachenko O.; Gorbatiuk V.

■.-----1

----------------□ □-------------------

Представлено новий метод прогнозування часових рядів, який динамічно знаходить ваги для вхідних факторів в залежності від конкретних значень самихфакторів.Запропонованийметод був перевірений на наборі реальних часових рядів і показав кращі результати у порівнянні з методом, що використовувався як базовий

Ключові слова: прогнозування часових рядів, лінійна регресія, Байєсівське усереднення моделей, нейронні мережі

□----------------------------------□

Представлен новый метод прогнозирования временных рядов, который динамически находит веса для входных факторов в зависимости от конкретных значений самих факторов. Предложенный метод был проверен на наборе реальных временных рядов и показал лучшие результаты по сравнению с методом, который использовался в качестве базового

Ключевые слова: прогнозирование временных рядов, линейная регрессия, Байесовское усреднение моделей, нейронные сети ----------------□ □-------------------

- і-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------*

УДК 004:519.2

A METHOD FOR BUILDING A FORECASTING MODEL WITH DYNAMIC WEIGHTS

V. Sineglazov

Doctor of Technical Sciences, Professor Department of Aviation Computer-Integrated Systems Institute of aerospace control systems, National Aviation University Komarov Av., 1, Kyiv, Ukraine, 03680 E-mail: [email protected] O. Chumachenko Candidate of Technical Sciences, Associate Professor*

E-mail: [email protected] V. Gorbatiuk* E-mail: [email protected] *Department of Technical Cybernetics National Technical University of Ukraine «Kyiv Polytechnic Institute» Peremogy Av., 37, Kyiv, Ukraine, 03056

1. Introduction

Forecasting has always been one of the most interesting and important problems of mankind. It is also one of the hardest problems, since to solve it we need to deal with the following issues:

a) it is impossible to take into account all the factors that influence the process we are trying to forecast; moreover, their influence can change over time - the factor which was not important today can play a major role tomorrow;

b) there are always a lot (sometimes infinite number) of plausible models that fit the training data well - we have to decide which model or set of models to use, and that’s usually a very error-prone decision;

c) it is often hard (if not impossible) to find the optimal complexity of the model.

In this paper we introduce the method that tries to deal with first two issues, i.e. it flexibly determines the set of models to use for the given inputs and takes into account the volatile significance/influence of the factors.

2. Problem statement

Let us have a sequence of N data points x = {x4,...,xN} measured at successive time points {t1v..,tN},t; -tM = T = const,i = 2.N. Then the problem of

forecasting (Fig. 1) considered in this paper can be stated as follows: using the data we have (Fig. 1, a), build the model of the forecasted process that takes n successive data points xi_n+1,.,xi as input and outputs the forecast for the value xi+k at some future time point ti+k (Fig. 1, b). This model can be represented mathematically as y = F(x1v..,xn) , where F is some unknown function. One important note is that this model can be defined implicitly or even work as a “black box” - we give it an input, we receive the desired output, which serves as a forecast.

Fig. 1. Graphic representation of the forecasting problem statement: a — known values; b — future values

3. Review of existing forecasting methods

The most well-known forecasting method is probably a linear regression [1]. It builds the following linear model:

y = F(x1,...,xn) = £wi*xi + w0 , (1)

i=1

where Wj......wn - importance weights of the input variables

x1,...,xn respectively; w0 - bias term, can be omitted. The weights are usually found by minimizing the mean squared error (MSE) of the model on the training data:

w =arg{mm|^(wT*xj+w0-yi)"|j, (2)

where xj = [xjl,...,xjn] - jth training case; -corresponding known output value.

Even though a linear regression remains one of the most widely used methods due to its simplicity, it has one natural limitation following from its definition: it cannot model complex nonlinear dependencies. To overcome this limitation a lot of nonlinear forecasting methods were developed. Let us mention the most widely used.

1. Group method of data handling (GMDH) [2]. The GMDH is a set of forecasting algorithms which are based on a recursive selection of the best models and the subsequent construction of more complex models using previously selected ones. The forecasting accuracy is improved by increasing a complexity of the models. The selection criterion is based on a model performance on the test set, while model’s parameters are determined from the training set. The simplest models also called base functions usually have the following form:

F(x1,...,xJ = a0 + £ai*xi + ££aijxixj + ---. (3)

i=i i=i j=i

However, any kind of base functions can be used, including harmonic series, exponential series etc.

The GMDH-like algorithms have proven to be really effective on real-life problems mainly because of their use of an external criterion (i.e. models are selected using data that wasn’t used for their training).

2. Artificial neural networks (ANN) [3]. An ANN is a system of connected and interacting artificial neurons - mathematical models of biological neural cells. An ANN is not programmed in the usual sense of the word: they are trained. During training, the neural network is able to detect complex relationships between input and output data and perform synthesis. The ability of neural networks to forecast comes directly from their ability to generalize and find the hidden relationships between input and output data. After training, the network is able to predict the future value of a certain sequence on the basis of several previous values and/or any current factors.

Mainly two architectures are used for the forecasting task: feed-forward neural network [4] (Fig. 2) and recurrent neural network [5] (Fig. 3). While feed-forward ANN basically corresponds to very complex function the recurrent ANN adds some dynamics, i. e. it has a finite dynamic response to time series input data.

The main advantage of an ANN over other methods of forecasting is that the network can equally well model practically any functional relationship, whereas most other methods are best suited for modelling some concrete type of functions (obviously, the method of polynomial smoothing is best suited for processes with a polynomial regular component, the method of Fourier series smoothing is best suited for processes with a periodic regular component etc.). Another important advantage of neural networks is the ability to learn.

Input Hidden Output

Layer Layer Layer

Fig. 2. Feedforward neural network architecture

u(k) u(k-l) u(k-2) y(k-3) y(k-2) y(k-l)

Fig. 3. Recurrent neural network architecture

3. Wavelet-based time series forecasting [6]. Many time series exhibit non-stationarity in their statistics. While the series may contain dominant periodic signals, these signals can vary in both amplitude and frequency over long periods of time. Ideally, one would like to separate the shorter period oscillations from the longer. Wavelet analysis attempts to solve these problems by decomposing the time series into time/frequency space simultaneously. One gets information on both the amplitude of any ”periodic” signals within the series and how this amplitude varies with time.

The wavelet-based forecasting suggests the use of a discrete wavelet transform [7] to obtain the corresponding wavelet coefficients and the subsequent prediction of the future values using these coefficients as inputs.

One step of discrete wavelet transform produces so-called detail coefficients and approximation coefficients given by:

yappr[n] =t x[k]g[2n_k], (4)

k=-M

ydetail [n] =£ x[k]h[2n _ kL (5)

k=-M

where g[2n_ k] and h[2n_ k] is an impulse response of the low-pass filter and high-pass filter respectively. Usually, the

t

approximation coefficients get decomposed further multiple times (Fig. 4).

Fig. 4. Graphic representation of wavelet decomposition

with an error function E = ^

X(w("

j=1

ij)_ yi

. It is imp-

ortant to omit a bias term - in practice, models without bias (given that training input vectors are zero-mean) usually have better prediction error on the whole data set.

3. To perform the next step we should introduce new error functions, one for each training case:

E = a’

£(Vx«)-yi +P*Z(wij_w(in)) ,

j=1 _ j=1

(8)

4. Various combinations of multiple methods. For instance, a combination of GMDH and ANN was suggested in [8]: instead of using predefined base functions small feedforward neural networks can be used, thus eliminating the issue with selecting the most appropriate type of base functions.

Despite of the variety of existing forecasting methods, most of them can be generalized using the following equation:

w* = arg{min{E[F(w,x),X,y]}}, (6)

where E is some error function that is minimized; F(w,x) -function that represents a forecasting model (linear or nonlinear); X - matrix of training cases; y - vector of known output values for the training cases. It’s clear that such an approach ignores the issues (a) and (b) given in the introduction - it uses a single model and assumes that input variables have constant influence.

4. Overview of the suggested forecasting method

where a and P are some constants, a,Pe(0;») ; wi =[wi1,.,win ] - new, ‘dynamic’ weights vectors, one for each training case. We need to make the squared prediction error for the ith training case small by choosing the appropriate weights wi, and to keep these weights close to the static ones w(in) in order to minimize the particular error function Ei . The tradeoff between how much to reduce the error and how close should the weights wi be to the initial ones is controlled by the parameters a and P ; if we set them to be a>P we want to improve the error more than to keep the weights and vice versa. To reduce the number of method’s parameters

a

we can divide all error functions by P and let Y = — ; now we

can see the meaning of the second parameter y : choosing y> 1 is equivalent to choosing a > P and y < 1 ^ a < P . When the input values lie in the range [_1;1] the suitable choice of y is somewhere between 0.1 and 0.3.

4. Find the optimal set of weights w* for each error function Ei by solving the following linear system, obtained as a result of finding the partial derivatives w.r.t. corresponding weights and equating them to 0:

The main idea of the method is to ‘dynamically’ find a set of weights for given inputs rather than use a single ‘static’ set of weights; in other words, the inputs are used for both finding the appropriate weights and predicting the output using these weights.

We suggest naming the method “linear regression with dynamic weights (LRDW)”.

Themethod’sinputsare thematrixoftrainingcases X e Rmxn and the vector of known output values y eRmx1 where m is a number of training cases and n is a number of input variables.

The preprocessing stage needed to obtain these matrices from the raw time series x = {xi},i = 1.N is left outside the scope of this method for the sake of simplicity (we suggest normalizing the time series values to the range [_1;1] and then using an embedding technique [8] with an appropriate embedding dimension and the horizon of prediction to obtain these matrices).

The method’s parameters are numbers Ke{1,2,...,m} and Y e (0;») which will be described later.

Main steps of the method are:

1. Subtract row mean from each row of the matrix X (i.e. make the set of its rows zero-mean):

j = x«_ V=1^m,j=1^n,

r/in) -

= [w(in),...,w(nm)] using standard linear regression

Ai*w* = bi,

where

(9)

A=

Yx,1 +1 yx^

Yxi2xi1 yx?2 +1

Yxinxi1

Yxi1xin

YXi2xin

Yx2n +1

(10)

b=

Yy ixi1 + w(in)' w*1

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

,w* =

Yyixin + wnin)_ w*n

Thus, the set of weights w* that minimizes the error function Ei can be found as w* = A-1 * bi.

5. Remember the matrix of discrete derivatives for each training case:

V=

(11)

n-1

(7)

where x = [x1v..,xn] - row mean vector of the matrix X.

2. Find the initial ‘static’ weights vector

6. The forecast for a new input vector x = [x1v..,xn]T is formed as follows:

a. Find the vector of derivatives v = [x2 -x1v..,xn -xn-1 ].

b. Find K nearest neighbors of v in the matrix V , and remember:

x,„ - x

1n-1

x -x

3

- their indices in sorted order - from nearest to furthest: idx = [idx1,...,idxK ]

- distances to neighbors d = [d1v..,dK],d1 < d2 < ... < dK

K

- total distance D = £ di .

i=1

c. Find K stored weights vectors for the corresponding training cases:

W=

*

idxK ,1

*

ww

idx] ,1 mxK

V • >

idxj ,n

= (w

idxj

d. Find the weights vector that will be actually used for the prediction. To do this, we should average the found weights vectors depending on the distance from v to the vector of discrete derivatives for the corresponding training case.

_ D D _

i. e. the weights vector for the nearest neighbor gets the biggest weight.

e. Finally, the forecast is calculated as:

The averaging weights are calculated as m =

y = (W*m)T*x .

(13)

5. Testing performance of the proposed method

To test the performance of the proposed method the set of 11 publicly available ([11, 12]) time series was used. Linear regression and GMDH were used for comparison. All methods shared the following parameters:

- time series embedding dimension ^ number of input variables n = 5

- horizon of prediction h = 2 (predicting the value 2 time steps ahead)

- training set size to full data set size ratio r = 0.5 (half of the cases were used for model training).

A bias term was omitted for both the LRDW and the linear regression; default parameters values of the specific GMDH implementation [13] were used; the method’s parameters were set to y = 0.2 and K = 1.

Normalized squared error (NSE) given by the formulae

£ (F(x1,.,xn) - yi )2 E = —-------------------- was used as a model performance

£y2

—=1

indicator. It was calculated on the full data set. The obtained results are given in the Table 1.

Table 1

NSE of tested models on the full data set

To sum up, the method finds a separate weights vector for each training case and then calculates the forecast for new inputs by finding the weights for K nearest training cases (nearest in the sense of Euclidean distance between the vectors of derivatives), weighting them based on the distance to produce a single set of weights and then applying these weights to the inputs. It is obvious that using this approach the weights ^ importance of the input variables will be different for different input vectors. Also, since each set of weights defines the corresponding forecasting model, we are not using a constant set of models - instead, we find the most appropriate set depending on the input. The parameter K plays a ‘smoothing’ role - the bigger the K the more weights vectors will be averaged the closer to the weights of a linear regression the average will be.

When searching for the nearest neighbors, vectors of derivatives are used instead of original vectors because for the time series forecasting problem the dynamics (i.e. how values change over time) is usually much more important and representative than the exact values of the forecasted process - for other problems, where inputs are not the successive points of some time series we should use original vectors.

The method is somewhat similar to Locally Linear Regression (LLR) [9] and Bayesian Model Averaging (BMA) [10]: it finds some kind of a local model for each training case similar to LLR and averages multiple models for the given inputs just like in BMA. However, LLR loses global information (‘static’ weights of a linear regression) while building local regressions - as a result, these local models can overfit badly. And opposite to BMA, where the set of averaged models is constant and we average the models’ outputs, the proposed method selects models to average depending on the inputs and averages the models themselves, not their output (there is no difference in a linear case, but in general these two averaging methods are not equivalent).

Name of time series Linear regression LRDW GMDH

Australian electricity production 0.017662 0.019721 0.012685

CATS benchmark [14] 0.002894 0.002696 0.002901

Dollar to euro exchange rate 0.063802 0.05511 0.062086

Dollar to pound exchange rate 0.055874 0.050277 0.058154

Consumer price index (CPI) 5.50E-05 0.007696 2.22E-05

Spanish electric energy demand 0.019363 0.024104 0.017655

Spanish mean interest rates 0.055512 0.048002 0.053009

Spanish stock exchange index 0.002652 0.005721 0.002495

Sunspots per month 0.5811 0.45099 0.17865

US aviation shipments 0.20121 0.15927 0.13734

Winter NAO index 1.0566 0.98757 1.0009

Total error 2.0567 1.8112 1.5259

Short analysis of the obtained results:

- the LRDW has better NSE on most but not all time series - so we need to carefully choose its parameters, especially y ;

- the average improvement in error is about 12 % relative to the NSE of a linear regression (and the biggest improvement is = 22.3 % for the ‘Sunspots per month’ time series);

- in general, GMDH performs better than the LRDW - however, the approach we used to obtain LRDW from a linear regression can be easily applied to other forecasting methods, including GMDH - and it can possibly boost their performance as well;

- there are several time series for which LRDW performed even better than GMDH.

The graphical example of the LRDW model producing better forecasts than the linear regression is given on the Fig. 5 (‘US aviation shipments’ time series).

As you can see, the forecast of a LRDW method is very similar to the one, obtained by linear regression, but for some

w

idx ,n

t

cases the proposed method gives much more accurate predictions (the training cases were selected randomly).

6. Conclusion

Fig. 5. Forecasts, obtained by two different methods: solid line — original time series, dotted line — LRDW forecast, line with markers — linear regression forecast

The proposed method was tested on real data and its performance (measured using NSE criterion) is usually better than the performance of the method it ‘originated’ from -linear regression. Hence, we believe that applying the same approach to other methods, including nonlinear ones like GMDH or neural networks can improve their performance also.

There are also possible improvements to the approach itself:

• instead of finding dynamic weights for each training case it is possible to find them for some clusters of training cases to improve the method’s runtime efficiency;

• the suitable choices for method’s parameters can possibly be determined from the training data - for example a value of the y parameter can somehow depend on the ratio between the total magnitude of static weights (i.e. sum of their values) and the magnitude of an error for this training case (when using these static weights);

• instead of finding nearest neighbors and averaging the corresponding dynamic weights we can build a model to predict the weights values from the inputs values using any suitable forecasting method.

References

1.

Cook, R. D. Influential Observations in Linear Regression [Text] / R. D. Cook // Journal of the American Statistical Association. -1979. - № 74. - P. 169-174.

Stepashko, V. S. GMDH Algorithms as Basis of Modeling Process Automation after Experimental Data [Text] / V. S. Stepashko // Sov. J. of Automation and Information Sciences. - 1988. - № 21 (4). - P. 43-53.

Rosenblatt, F. The Perceptron: A Probalistic Model For Information Storage And Organization In The Brain [Text] / F. Rosenblatt // Psychological Review. - 1958. - № 65 (6). - P. 386-408.

Auer, P. A learning rule for very simple universal approximators consisting of a single layer of perceptrons [Text] / P. Auer, B. Harald, M. Wolfgang // Neural Networks. - 2008. - № 21 (5). - P. 786-795.

Elman, J. L. Finding Structure in Time [Text] / J. L. Elman // Cognitive Science. - 1990. - № 14 (2). - P. 179-211.

Benaouda, D. Wavelet-based nonlinear multi-scale decomposition model for electricity load forecasting [Text] / D. Benaouda, F. Murtagh, J. L. Starck, O. Renaud // Neurocomputing. - 2006. - № 70. - P. 139-154.

Akansu, A. N. Wavelet Transforms in Signal Processing: A Review of Emerging Applications [Text] / A. N. Akansu, W. A. Serdijn, I. W. Selesnick // Physical Communication, Elsevier. - 2010. - № 3 (1). - P. 1-18.

Sineglazov, V. An algorithm for solving the problem of forecasting [Text] / V. Sineglazov, E. Chumachenko, V. Gorbatiuk // Aviation. -2013. - № 17 (1). - P. 9-13.

Cleveland, W. S. Robust Locally Weighted Regression and Smoothing Scatterplots [Text] / W. S. Cleveland // Journal of the American Statistical Association. - 1979. - № 74 (368). - P. 829-836.

10. Hoeting, J. A. Bayesian Model Averaging: A Tutorial [Text] / J. A. Hoeting, D. Madigan, A. E. Raftery, C. T. Volinsky // Statistical Science. - 1999. - № 14 (4). - P. 382-401.

11. U.S. General Aviation Aircraft Shipments and Sales [Electronic resource] / Barr Group Aerospace & AeroWeb / Available at: http://www.bga-aeroweb.com/database/Data3/US-General-Aviation-Aircraft-Sales-and-Shipments.xls. - 2014.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

12. Data Sets for Time-Series Analysis [Electronic resource] / Evolutionary and Neural Computation for Time Series Prediction Minisite. - Available at: http://tracer.uc3m.es/tws/TimeSeriesWeb/repo.html - 2005.

13. Jekabsons, G. GMDH-type Polynomial Neural Networks for Matlab [Electronic resource] / Gints Jekabsons. Regression software and datasets. - Available at: http://www.cs.rtu.lv/jekabsons/ - 2013.

14. Lendasse, A. Time Series Prediction Competition: The CATS Benchmark [Text] / A. Lendasse, E. Oja, O. Simula, M. Verleysen // International Joint Conference on Neural Networks, Budapest (Hungary), IEEE. - 2004. - P. 1615-1620.

9.

A method for building a forecasting model with dynamic weights Текст научной статьи по специальности «Компьютерные и информационные науки»

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Sineglazov V., Chumachenko O., Gorbatiuk V.

Похожие темы научных работ по компьютерным и информационным наукам , автор научной работы — Sineglazov V., Chumachenko O., Gorbatiuk V.

A method for building a forecasting model with dynamic weights

Текст научной работы на тему «A method for building a forecasting model with dynamic weights»