Научная статья на тему 'Piecewise polynomial models for aggregation and regression analysis in remote sensing of the Earth problems'

Piecewise polynomial models for aggregation and regression analysis in remote sensing of the Earth problems Текст научной статьи по специальности «Математика»

CC BY
67
6
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
NUMERICAL PROBABILISTIC ANALYSIS / DATA AGGREGATION / REGRESSION MODELING / PIECEWISE POLYNOMIAL MODEL / DENSITY FUNCTION / ЧИСЛЕННЫЙ ВЕРОЯТНОСТНЫЙ АНАЛИЗ / АГРЕГАЦИЯ / РЕГРЕССИОННЫЙ АНАЛИЗ / КУСОЧНО ПОЛИНОМИАЛЬНЫЕ МОДЕЛИ / ФУНКЦИИ ПЛОТНОСТИ

Аннотация научной статьи по математике, автор научной работы — Popova Olga A.

We discuss the procedures of data aggregation as a preprocessing stage for subsequent to regression modeling. An important feature of study is demonstration of the way how represent the aggregated data. It is proposed to use piecewise polynomial models, including spline aggregate functions. We show that the proposed approach to data aggregation can be interpreted as the frequency distribution. To study its properties density function concept is used. Applying data aggregation models as input and output variables we propose a new probability density function value linear regression model (Distributions Regression). To calculate the data aggregation and regression model we employ numerical probabilistic analysis (NPA). To demonstrate the degree of the correspondence of the proposed methods to reality, we developed a theoretical framework and considered numerical examples.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Кусочно-полиномиальные модели для агрегации и регрессионного анализа в задачах дистанционного зондирования Земли

Предложены новые подходы для исследования и анализа зависимостей в эмпирических данных. Обсуждаются вопросы агрегирования и различные виды математических моделей агрегированных данных. Для больших объемов данных предлагается использовать процедуры агрегирования на основе кусочно-полиномиальных моделей. Рассматриваются вопросы повышения точности построения кусочно-полиномиальных моделей в виде полиномиальных сплайнов. Рассмотрены новые подходы в задачах восстановления функциональных зависимостей на основе сплайн-агрегаций.

Текст научной работы на тему «Piecewise polynomial models for aggregation and regression analysis in remote sensing of the Earth problems»

Journal of Siberian Federal University. Engineering & Technologies, 2018, 11(8), 964-973

yflK 519.24

Piecewise Polynomial Models for Aggregation and Regression Analysis in Remote Sensing of the Earth Problems

olga A. Popova*

Siberian Federal University 79 Svobodny, Krasnoyarsk, 660041, Russia

Received 27.12.2017, received in revised form 19.08.2018, accepted 21.10.2018

We discuss the procedures of data aggregation as a preprocessing stage for subsequent to regression modeling. An important feature of study is demonstration of the way how represent the aggregated data. It is proposed to use piecewise polynomial models, including spline aggregate functions. We show that the proposed approach to data aggregation can be interpreted as the frequency distribution. To study its properties density function concept is used. Applying data aggregation models as input and output variables we propose a new probability density function value linear regression model (Distributions Regression). To calculate the data aggregation and regression model we employ numerical probabilistic analysis (NPA). To demonstrate the degree of the correspondence of the proposed methods to reality, we developed a theoretical framework and considered numerical examples.

Keywords: numerical probabilistic analysis, data aggregation, regression modeling, piecewise polynomial model, density function.

Citation: Popova O.A. Piecewise polynomial models for aggregation and regression analysis in remote sensing of the Earth problems, J. Sib. Fed. Univ. Eng. technol., 2018, 11(8), 964-973. DOI: 10.17516/1999-494X-0118.

Кусочно-полиномиальные модели для агрегации и регрессионного анализа

в задачах дистанционного зондирования Земли

О.А. Попова

Сибирский федеральный университет Россия, 660041, Красноярск, пр. Свободный, 79

Предложены новые подходы для исследования и анализа зависимостей в эмпирических данных. Обсуждаются вопросы агрегирования и различные виды математических моделей

© Siberian Federal University. All rights reserved

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). Corresponding author E-mail address: BDobronets@yandex.ru

агрегированных данных. Для больших объемов данных предлагается использовать процедуры агрегирования на основе кусочно-полиномиальных моделей. Рассматриваются вопросы повышения точности построения кусочно-полиномиальных моделей в виде полиномиальных сплайнов. Рассмотрены новые подходы в задачах восстановления функциональных зависимостей на основе сплайн-агрегаций.

Ключевые слова: численный вероятностный анализ, агрегация, регрессионный анализ, кусочно-полиномиальные модели, функции плотности.

1. Introduction

The linear regression analysis is often used to discovery dependencies of the empirical data. Properties of empirical data including the type and level of uncertainty significantly affect on the results of the simulation. It is well-known, that the random uncertainty concept addressed by probability theory plays a fundamental role to study data uncertainty.

This work continues the research begun in [1]. As a new mathematical branch, the linear regression for interval-valued and histogram-valued variables presents a new study for uncertainty data processing [2-5].

This problem becomes more complicated if large amounts of data are processed. In this case it is useful to look at the empirical data in an aggregated form. Aggregation is a popular method of converting data. For example, the application of the histogram allows to reduce dimension of data set and level of uncertainty and to significantly increase the efficiency of numerical calculations. It is important to note that histograms are the examples of symbolic data used in the Symbolic data analysis [3]. Symbolic Data Analysis and Data Mining use histograms to study a variety of different processes and are applied for modeling the variability of quantitative characteristics. Histogram data models and histogram regression models based on the Symbolic analysis is a new important direction to discover knowledge in a data base.

In our study we consider a new approach to regression modeling using input data aggregation. To develop our approach for performing efficient aggregation we employ piecewise polynomial aggregation functions, including piecewise linear functions and piecewise constant functions. Histogram is a good example of piecewise constant functions of which are perfectly employed in our study.

To examine the structure of data aggregation we use the probability density functions (PDF). The concept of the mathematical aggregation functions is used to the regression modeling. To illustrate this we will regard the spline aggregation function in more detail. This approach will allow to employ the density function models as input and output data. It is of further importance that, the data uncertainty is studied to identify the relationship between the input and output characteristics when the input probability density functions are unknown. Thus, in order to describe any specific PDF we need to consider their spline interpolation.

In this work we propose a new linear regression model named a PDF-valued variable regression. The abbreviated form of the regression model is called a Distributions Regression. If we use a spline aggregation model it is named a PDF-spline valued variable Regression Model and a Distributions Regression in shot. The following statements confirm the justification of PDF-spline models. The application of the spline procedures allows big data aggregating reduce the level of uncertainty and

to significantly increase the efficiency of numerical calculations. These splines allow considerably accurate representions of the arbitrary distribution.

To demonstrate the degree of the relevance of the proposed methods to reality, we developed a theoretical study and provided numerical examples to illustrate it. With this we propose a conclusive discussion of this approach applicability to the uncertainty treatment and big data processing. The comparison of NPA and Monte Carlo method showed good agreement of the results. At the same time, numerical experiments demonstrate that the PDF arithmetic is more than hundred times faster than the Monte Carlo method [6]. As a result, the NPA approach can be successfully applied for solving computational and engineering problems.

The structure of the remaining sections is stated as follows. Section 2 reviews the data aggregation models. Section 3 covers the discussion the numerical arithmetic for probability density function. In Section 4, we review the spline interpolation. In Section 5 we study the Spline aggregation. The questions about the application of the regression approach to spline-aggregated time series are discussed in Section 6, Section 7 concludes the paper.

2. Data aggregation

In this section we will look at data aggregation as a pretreatment method for subsequent for numerical modeling.

The essence of the aggregation procedure are methods for reducing the dimension of the original empirical data, knowledge discovery and reduce data uncertainty. Data aggregation procedure plays the most concerned role in the process of to extracting useful information from a large volume of data. The essence of the aggregation procedure is to constitute methods for reducing the initial data set to a less data collection. Aggregation can be considered as a data conversion process with a high degree of detail to a more generalized representation. An example of such procedures is a simple summation, calculation of the average, median, mode and range of maximum or minimum values.

The aggregation procedure has its own advantages and disadvantages. On the positive side, we note, that the detailed data are often very volatile due to the impact of different random factors, making it difficult to discover general trends and data patterns. In many cases it is useful first to look at numerical big data in an aggregated form such as a summation or an average.

It is important to bear in mind that the use of such aggregation procedures as averaging, exclusion of the extreme values (emission) and, smoothing procedure can lead to a loss of important information. There are various methods of data aggregation. Therefore, the choice of the method of aggregation is a complex problem, as wrongly selected numerical methods of calculation may introduce additional uncertainty that was not presented in the original problem.

The data aggregation can use various mathematical models. NPA outlines the following key models histograms, frequency polygons and splines.

Histograms. A random variable whose probability density function is represented by a piecewise constant function called a histogram. Any histogram P is defined by a grid {x, n i = 0,.. ,,n} and a set of values pt, i = 1,...,n, such that the histogram takes the constant value pt at the interval [x,_pxt].

Histograms are widely used for the processing and analysis of remote sensing data. For example, in [?] is considered the problem of the study of natural processes on the basis of space and ground

monitoring data. In addition to histograms we discuss the piecewise linear functions (polygons) and splines.

The piecewise linear function (Frequency Polygons). Piecewise linear functions can considered as a tool of approximation of the density function random variable. A piecewise linear function is a function composed of straight-line sections. The frequency polygon (FP) is a continuous density estimator based on the histogram. In one dimension, the frequency polygon is the linear interpolant of the midpoints of an equally spaced histogram.

These are the simplest splines. Although such functions are relatively simple, they have good approximating properties, and they can be tools for approximating probability density functions.

Spline. A spline is a sufficiently smooth polynomial function that is piecewise-defined, and possesses a high degree of smoothness at the places where the polynomial pieces connect (which are known as knots). We will consider the probability density of the random variables as an approximated spline.

To adapt the classical regression model to these kinds of data aggregation considered as PDF-valued variables we must use relevant numerical arithmetic. To meet this aim, we will consider in more detail the numerical operations on probability density functions developed in the framework of numerical probability analysis.

Let a system of two continuous random variables (xi, x 2) with probability density function be p(xp x2). The densities resulting from arithmetic operations on random variables following those distributions are given by [6]. For example, to find the probability density function pof a sum of two random variables x 1 + x 2 we use by

To compute of the probability density function pxly as the result of dividing of two random variables xi x 2 we use

The probability density function pxy resulting from multiplying of two random variables xy is computed accordingly by

Commutativity and associativity of arithmetic operations on addition and multiplication directly follows from these expressions. Let us consider the question of the existence of inverse elements in summa and multiplication operations. Note that there is a solution to the equation a + x = 0 which could be represented as a joint probability density function of (a, x)

3. Numerical arithmetic for Probability Density Function

p*+y (x) =f p(x - v, v)dv =f p(v x - v)dv.

■' J J-I

(1)

0 vp(XV,v)dv -J^ vp(v,xv)dv.

(2)

(1/v)p(xlv, v)dv - (1/v)p(v, x/v)dv.

0 J-»

(3)

where pa (x) is probability density function for random variable a.

There by p-a (x, y) show the inverse elements in summa operations for a. Similarly,we can easy construct the inverse element for multiplication. Let f (x) be probability density function for a random variable x. Then

1 x

fa (x) = - f(-) (4)

a a

is probability density function for random variable ax, where a is real variable.

We consider the problem of calculating by the derivative with respect to the parameter a for ax. It follows directly from the definition of the derivative that

d (a + h)x - ax (a + h - a)x

— ax = lim-=-= x.

da h h

Consider the case of computing the derivative with respect to the parameter a from fa (x)

(x) = d f1 f(x) 1 = -A-f(x) - - Г(-)-. (5)

da da V. a a J a a a a a

Next, we calculate the derivative with respect to a for probability density function fx+ of summa

ax + y .

d_

da

fx++y (z) = -f J" V- /1(-) J /2(z -- )d- =

da J-"V a a J

= J" f-JT/;(-)-1 /'.(-)f,(z--)d-.

a a a a a J

(6)

Let / be probability density function of x, where / is probability density function of y. If random variables (x, y) are independent, we can calculate the joint probability density function p(x, y) as a product of p(x,y) = Pj(x)p2(y), wherep1, p2 will be a probability density function of

x, y.

Let pt be represented to piecewise polynomial functions belonging to the mesh ro, = {z(w,l = 0,...,nw} and function pt is the polynomial on the each interval of [z^,zk'+J. Consequently p(x,y) is a polynomial x-)s(2)(x2) on [z^,zf+^j x [zf},zj+1]. For example, we consider to build the density probability function px+y for sum of two variables

x + y.

Let it be necessary to calculate the value of the probability density function px+y (z) from the sum of two variables at a certain point x. The integral (1) will be calculated numerically. Then we define the indexes (k, l) e Ix of cells [zf1, z^ ] x [z}2), zj+l ] that intersect the line x+y = z. On each such cell, the calculation of the integral (1) reduces to calculating the integral

Intkk,1) = J*- 4™(v)sf(x - V) dv,

where zk*}, zk+)1 are the coordinates of the projection on the axis x1 of the segment

[ zf, zk+i] x [ zf, z®] n {x- + x2 = x}.

Since the integrand is the product of polynomials, we can calculate Int(k l ) both analytically and by using the numerical integration procedures that are exact on polynomials.

Finally,we obtain the following form

Px+y (Z) = Z Int(kJ) •

(k,l )eIx

To construct a piecewise polynomial approximation of px+y, we construct a mesh ю = {x0, xl,xn} in the support domain to compute f = px+y (xt). Next, using the valuesf on the mesh m, we construct, for example, a cubic spline 5. In this case the estimate is calculated as

|| px++ -s(v) ||< Kh4- iipc; ||,v = 1,2,3.

The computed the value will then look as follows norm = { s( x)dx,

If the norm ф 1, then we finally assume that s(x) := s(x)/ norm.

4. Spline interpolation

Prior to consideration of spline aggregation, we propose study of mathematical models applicable for the representation of splines and will discuss the interpolation questions with their application.

Let ю = {x0 < x1 < x2 <... < xn} be mesh and interpolation conditions

5( X,.) = f (X), i = 0,., n.

Where boundary conditions are given as

5( Xo) = 0, s'( xn) = 0.

The cubic spline on a mesh {x,} with step h = max(x,+1 - xi), i = 0,..., n satisfies the estimate

|| f -sv ||<h4-v II f(4) ll,v = 0,1,2. (7)

The task of the spline construction is reduced for solving a system of linear algebraic equations with a tridiagonal matrix [8]

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

\jmj-1 + 2mj + jj+1 = dj, (8)

2m0 + m1 = 3( f1 - f)) /h1 - h1 Z02 / 2

2mn + mB-, = 3(fn - fn_x)/hn + hnzl/2,

dj = 3Xj (fj - fj-1)/hj + 3 ц j (j - fj )/hj+„ j = 1,..., N -1,

where mt = s'(xi). As a result, a cubic spline on the intervals [xj-1,], j = 1,...,N will have the following representation [8]:

5(x) = ту-1 (xj - x)2 (x - xj-/hj - mj (x - x^)2 (xj - x)/hj + (9)

+j(x. -x)2(2(x-x- + hj)/hj + +f(x-x.-2(2(x. -x) + hj)/hj,

5. Spline aggregation

Let consider the spline approach to build regression model with the Distributions-valued variables. This approach is useful due to the following reasons. Underlying of this approach is the notion of the spline. The spline can be regarded as a mathematical object that is easy to describe and calculate the mathematical procedures and operations, in the process of maintaining the essence of data frequency distribution.

Since the spline is a piecewise polynomial function then it can be regarded as a data aggregation function in aggregation issues. Aggregation function performs numerical calculations on a data set and returns the spline values. Splines are useful for data uncertainty analysis due to fact that they adequately represent the random distribution of random variables.

Despite its simplicity, the spline also covers all possible ranges of probability density function estimation. Simple and flexible spline structure greatly simplifies their use in numerical calculations and it has a clear visual image, which is useful for analytical conclusions. It is important to note that the construction of regression models with aggregated inputs require the use of appropriate numerical procedures. To this end, we consider numerical probabilistic analysis. We propose to use of the numerical probabilistic analysis to compute the arithmetic operations for the aggregated data and to apply for regression modeling.

Let us assume that we know the samples H = {C, C2,..., CN } of a random variable 4 with probability density functionf(x) and support [a,b]. Далее рассмотрим использование Richardson's extrapolation для повышение точности kernel estimator.

The basic kernel estimator may be written compactly as [7]

,, 1 nV x - С 1 N

f h(x) = — Z K(-i-) = — Z Kh (x - С), Jy' Nh tr h' Nh t!

where Kh (t) = K(t/h)/h. Note

Kh (x, Ci) = K (^ x h

where 4 is a random variable with probability density function fx). Then

E[ f h( x)] = E[ Kk (x, С )]

and

a n = Var[ f h( x)] = N^Var[ Kh (x, С)].

The value of the mathematical expectation can be written as

1 x — t i*^

E[ Kh (x, С )] =-f K (—) f (t) dt =f K (n) f (x - hn) dn. hJ-™ h

Notice, that

h2 h3

f{x - hn) = f{x) - hf (x)n + — f" (x)n2 + — f(3) (x)n3 + O(h4).

E[Kh (x, - )] = f(x)f K(n )dn - hf '(x)fn K(n )dn +

J-" J-"

h2 h +y f (x)J-" K (n )n2 dn + f(3)(x)J-" K (n )n 3dn + O(h4).

Suppose that the kernel K satisfies the requirements

iOT (• CO

K(n )dn = 1, f n K(n )dn = 0

and

n3 K (n )dn = 0.

Denote

J" n2 K(n) dn = a2.

Then

f = E[ Kh (x, -)] = f (x) + a 2h2 f' (x)/2 + O(h4)

and

E[ f h( x) - / (x)] = a2 h2 /" (x) / 2 + O (h4).

Define fh (x)

fh (x) = E[ f h( x)] = f (x) + a2 h2 f" (x) / 2 + O (h4). (10)

and f2h (x)

f2 h (x) = E[ f h( x)] = f (x) + 4a2 h2 f" (x) / 2 + O (h4). (11)

Let we apply the Richardson's extrapolation to f (x) and f (x) [9]. In the next stage, we multiply (10) on 1/4 to subtract the result from (11) Excluding a h f"(x)/2 from (10) and (11), we get

f (x) = 3 fh (x) - 3 f2h (x) + O(h4). Noting that we have constructed the approximation to the function f (x)

fl (x) = 3 /(x)- 3 f (xX (12)

with the accuracy O(h4).

Thus, successively assuming z e ro, we obtain the values fchor (xi) = f(xi) + O(h4) and the system of linear algebraic equations (8) for constructing a cubic spline. To improve the reconstruction accuracy of probability density at the point z we use the combination of kernel assessments with the parameters h and 2h.

As an evidence we refer to Schweizer who states, that "distributions are the numbers of the future" [10]. Thus, instead of simplifying them, it seems better to propose methods which deal with distributions directly. In order to do this, one has to determine how to represent the observed distributions.

- 971 -

In our study we propose to represent them by using a piecewise polynomial aggregation function, as long as it offers a good tradeoff between simplicity and accuracy.

6. Distributions Regression

Consider a linear model

n

Y = a0 + ^ at Xi + e,

1=1

where X, i = 1,..., n are independent predictor variables, Y is a dependent variable, e is an error. From the observed values of Yj Xtj after the aggregation of the density Y, Xi are represented by splines: Sy,

S.

We shall seek the unknown parameters a, i = 0,1,...,n starting from the minimum of the functional

®(a0 ,ai ,...,an ) =11 Sy - (a0 +X n=1 aiXi) 12 ^ min.

By virtue of the independence of Xi, numerical operations on density functions can be used to calculate the functional ®(a0, a1,., an). The minimization of the functional ®(a0, a1,., an) can be carried out by the method of steepest descent.

Numerical example. Let us consider model problem

n

Y = a0 + ^ at Xi + e,

i=1

For numerical realization, X1, X2 were generated as sums of random variables with an Irvine-Hall distribution n = 3 and shifted by 1 and 2, respectively, e with probability density function (12x | -1)2(212x | +1) with support [-0.5,0.5].

The variable Y was constructed as follows Y = X1 + X2 + e.

The minimization of the functional ®(a0, a1, a2) was carried out by the method of steepest descent. For a0 =-0.089, a1 = 1.031, a2 = 1.029, the value ®(a0,a1,a2) did not exceed the value 0.3 • 10-3.

Thus, a numerical example showed the possibility of using distributions regression.

7. Conclusion

Data transformation during the data aggregation phase is an important direction in the analysis of data. Well-chosen data models at the aggregation stage allow determine the form of the input variables and select the appropriate procedures and arithmetic for later modeling.

The use of regression modeling on the basis of piecewise polynomial models opens up new possibilities in forecasting the problems of hydrology, remote sensing of the Earth, estimating the reliability of critical equipment.

References

[1] Dobronets B.S., Popova O.A. The numerical probabilistic approach to the processing and presentation of remote monitoring data. Journal of Siberian Federal University. Engineering & Technologies, 2016, 9(7), 960-971.

[2] Добронец Б.С., Попова О.А. Численный вероятностный анализ неопределенных данных. Красноярск, Сибирский федеральный университет, Институт космических и информационных технологий. 2014. 168 с. [Dobronets B.S., Popova O.A. Numerical probabilistic analysis of uncertain data. Krasnoyarsk, Siberian Federal University, Institute of Space and Information Technologies. 2014. 168 p. (in Russian)]

[3] Billard L., Diday E. Symbolic Data Analysis: Conceptual Statistics and Data Mining. Chichester, John Wiley & Sons, Ltd., 2006. 321 p.

[4] Dias S., Brito P. Linear Regression Model with Histogram-Valued Variables. Published online in Wiley Online Library (wileyonlinelibrary.com). 2015, D0I:10.1002/sam.11260

[5] Koenker R. Quantile regression. Cambridge university press, 2005. 349 p.

[6] Dobronets B.S., Krantsevich A.M., Krantsevich N.M. Software implementation of numerical operations on random variables. Journal of Siberian Federal University. Mathematics & Physics, 2013, 6(2), 168-173.

[7] Scott R.W. Multivariate density estimation: theory, practice, and visualization. New York, John Wiley & Sons, 2015. 380 p.

[8] Ahlberg J.H., Nilson E.N., Walsh J.L. The theory of splines and their applications. New York, Academic Press, 1967. 284 p.

[9] Dobronets B.S., Popova O.A. Improving the accuracy of the probability density function estimation. Journal of Siberian Federal University. Mathematics and Physics, 2017, 10(1), 16-21.

[10] Schweizer B. Distributions Are the Numbers of the Future. Proceedings of The Mathematics of Fuzzy Systems Meeting, eds. A. di Nola and A. Ventre, Naples, University of Naples, 1984, 137-149.

i Надоели баннеры? Вы всегда можете отключить рекламу.