D-posterior approach in regression

Zaikin Artyom Alexandrovich

2018, Т. 160, кн. 2 С. 410-418

УЧЕНЫЕ ЗАПИСКИ КАЗАНСКОГО УНИВЕРСИТЕТА. СЕРИЯ ФИЗИКО-МАТЕМАТИЧЕСКИЕ НАУКИ

ISSN 2541-7746 (Print) ISSN 2500-2198 (Online)

UDK 519.226.3

d-POSTERIOR APPROACH IN REGRESSION

A.A. Zaikin

Kazan Federal University, Kazan, 420008 Russia Abstract

In this paper, we have used the d-posterior approach in regression. Regression predictions are a sequence of similarly made decisions. Thus, d-risk can be helpful to estimate the quality of such decisions. We have introduced a method to apply the d-posterior approach in regression models. This method is based on posterior predictive distribution of the dependent variable with the given novel input of predictors. In order to make d-risk of the prediction rule meaningful, we have also considered adding probability distribution of the novel input to the model.

The method has been applied to simple regression models. Firstly, linear regression with Gaussian white noise has been considered. For the quadratic loss function, estimates with uniformly minimal d-risks have been constructed. It appears that the parameter estimate in this model is equal to the Bayesian estimate, but the prediction rule is slightly different. Secondly, regression for the binary dependent variable has been investigated. In this case, the d-posterior approach is used for the logit regression model. As for the 0-1 loss function, the estimate with uniformly minimal d-risk does not exist, we suggested a classification rule, which minimizes the maximum of two d-risks. The resulting decision rules for both models are compared to the usual Bayesian decisions and the decisions based on the maximum likelihood principle.

Keywords: Bayesian inference, regression, d-risk

Introduction

When solving the regression problem, the first step is always "training" the model using a finite sample. Only then are the estimates of the model parameters used to predict the dependent variable value for every new set of predictors. The prediction itself is a process of making a variety of similar decisions, which gives a reason to apply the d-posterior approach [1, 2] to control potential risks of such decisions. Since d-risk can be interpreted as expected loss for a particular prediction value, the d-posterior approach is a natural alternative to the maximum likelihood and Bayesian principles in solving this problem.

The d-posterior approach has not been tried for regression problems yet. That is to say, there is a regression technique [3], which controls the false discovery rate (FDR), the decision quality related to d-risk. However, the statistical model used in [3] is not the same as in the classical regression problem statement. Here, we discuss two classical regression models: linear regression with Gaussian white noise and quadratic loss function, and logit linear regression with zero-one loss function.

1. Regression

All vectors will be treated as column vectors in this paper. We will also denote transposing with a superscript T, so that XTY is a scalar product of two real m-vectors for X e Rm and Y e Rm .

Let us fix positive integers m and n, such that n > m, and suppose that we have a full rank real matrix

x1,1 • • • x1,m

X

xn,1

This matrix defines predicates for regression. Let Xi be i-th row of the matrix X.

We assume that for every vector X G Rm there exists some density p(y|X, 9) with respect to some measure, where 9 is an unknown parameter from the parameter space ©. In this paper, we consider only the case of linear regression, where © = Rm and p(y|X, 9) = f (y |XT 9) for some density f. Hereafter, any kind of probability density would be denoted as p .It will be clear from the context which density is meant.

The d-posterior approach is a subsection of the Bayesian statistics. Therefore, we need to define the prior distribution of the parameter 9 . We assume that 9 is a sample value of the continuous random variable $ with the known density p(9).

Let us suppose that we deal with independent observations Y = (yi,..., yn)T, which follow the distribution defined by p(y|Xj,9). The likelihood function is p(Y|X, 9) =

n

p(yj |Xj, 9). The posterior density is

i= 1

p(9|Y, X) = p(Y|X, 9)p(9) / / p(Y|X, 9)p(9) d9.

(1)

The first problem with regression is to find some estimate 9 of 9, which can be used later to estimate the density p(y*|X*,9) of the predicted variable y* for a novel predictor input X*. The usual maximum likelihood estimate (MLE) maximizes the likelihood p(Y|X, 9) with respect to 9. Another option is to use the maximum a posteriori (MAP) estimate, which maximizes the posterior density instead (the same as maximizing p(Y|X, 9)p(9)). Both options are very popular (MAP estimates are used in numerous regularization techniques, such as ridge regression, LASSO, etc.).

The less popular option is the Bayesian estimate. Firstly, one needs to specify the loss function L(91, 92). The prior risk function for estimate 9 is then R99 = EL($, 9 ). The Bayesian estimate minimizes the risk with respect to 9. The corresponding estimation could be achieved more easily by minimizing the posterior risk

R(d|Y, X) = E [L($, d) | Y, X] = J L(9, d)p(9|Y, X) d9

with respect to d .In the case of the quadratic loss function, the Bayesian estimate is a posterior mean. The Bayesian estimates are not widely used, because they usually require intensive computation. The d-risk of the estimate 99 is

R9(d) = E L($,9)|9 = d

Since d-risk is a function, there are different possible ways to define the estimate that "minimizes" its d-risk. For some statistical models, this minimizing is trivial: there exists such an estimate 9* that

R*(d) <R?(d), (2)

for any d and any estimate 9. The estimate 9* , which satisfies (2), is called an estimate with uniformly minimal d-risk (we will designate it as U-estimate). There is a way to

x

'n'm

find this estimate [1]: one needs to minimize R(d|Y, X) with respect to its random arguments (in the case of the above-given regression statement, it is Y). If for every Y exists at least one d, for which R(d|Y, X) is minimized, then this d (or any, if there are multiple solutions) is a U-estimate. As it was said earlier, none of such estimates have been used in the literature.

However, estimating the parameters is not best suited for the d-posterior approach. The thing is, the d-risk can be interpreted as average loss for a particular decision value among a succession of experiments. Actually, the parameter 9 is estimated only once. Therefore, U-estimates can be viewed only as somewhat regularized MLEs, just like the Bayesian or MAP estimates. On the other hand, prediction for a single training sample can be made infinitely many times. This gives the opportunity to use the d-risk for assessing the quality of custom prediction rules. However, in order to make the most sense of the definition of d-risk, we need to make the predictors random. Indeed, in the case of constant predictor vector X*, the d-risk is "average loss for particular decision value among a succession of experiments, with predictors equal to X* ". If X* is a random vector, then d-risk is "average loss for particular decision value among a succession of experiments". Hence, we will present various possibilities for X* distribution. Note that we formally do not need to specify the distribution of predictors of the matrix X.

For a new set of predictors X* , the predictive posterior distribution of the dependent variable y* is

p(y*|X*,Y, X) = jp(y*|9,X*)p(9|X, Y) d9. (3)

e

The right side of (3) is obtained using the fact that distribution of y* provided that X* and 9 are not dependent on the training sample Y and X . The same can be said about the distribution of Y , which does not depend on X* given X and 9 . The posterior predictive of y* can be used in the same manner as the usual posterior distribution in terms of the Bayesian rules and rules which minimize the d-risk. In this case, y* is the "parameter", for which we can specify the loss function and posterior risk.

2. Linear regression with Gaussian noise

This section focuses on studying of the following model:

yi = XT9 + £i, £i -N(0,a2), i = l,...,n, (4)

for a mutually independent ei. We assume the variance a2 of the white noise to be known. We will discuss this matter later. The likelihood function of this model is

p(Y|X, 9) = ^ (Y|X9, a2/„),

where A) is the density function of multivariate normal distribution with

the mean vector ^ and covariance matrix A, and Id is the d x d identity matrix. The corresponding cumulative distribution function will be denoted as A). For

the sake of simplicity, the PDF and CDF for the standard univariate (with mean zero and variance equal to one) normal distribution will be denoted as <^(x) and $(x), respectively.

For this model, the MLE of 9 is well-known:

0mi = (XT X)-1XT Y.

If we try to apply the maximum likelihood principle to prediction, we can obtain the usual predicting scheme, which involves MLE of 9. Indeed, in order to maximize

p(y*,Y|9, X, X*) with respect to unknown parameters, we need to maximize it with respect to 9 and y* . Since the mode of normal distribution is its mean, the MLE of y* is X*T9mi. Note that here we do not need to estimate a2 for prediction. For the Bayesian analysis, we need to specify the prior:

p(9) = ¥>(9|0, t 2/m).

The parameter t is known. The posterior density of 9 is

p(9|Y, X) = v(9

s XT y'a s =( ± + —

t a

-1

This expression can be derived from the convolution theorem. The Bayesian estimate of 9 for the quadratic loss function L(9, d) = ||9 — d||2 is the posterior mean:

- SXT Y

Note that this expression depends on a and t .

In the same Bayesian setting, the posterior predictive for a new set of the predictor variables X* is

/ xTSXTY \

p(y*|X*,Y, X) = ^(y* -,X*TSX* + a2 J . (5)

This can be derived from the direct representation of the model (4), because the linear transformation of normal distribution is normal distribution as well. If the posterior predictive is treated as posterior distribution, then for the quadratic loss function L(y*, d) = (y* — d)2 the Bayesian estimate for y* is X*99B .

Now, let us consider the d-posterior approach to estimate 9. For the model studied in this section, we can find the U-estimate. Indeed, the posterior risk can be expressed as

R(d|Y, X) = E [||0 — d||2 | Y, X] =

E

||0 — 9b ||2 | Y, X + ||9b — d||2 = trace(S) + ||9b — d||2.

We used the fact that E | Y, X = 9b . Since the only term which depends on Y is the term with 9b , the minimizing with respect to Y is straightforward, and the U-estimate of 9 is equal to the Bayesian estimate 9B .

Now, we want to find the U-estimate for y*. As it was said in the previous section, in order to make the use of d-risk meaningful, we need to consider X* random. Surprisingly, we do not need to specify it unless we assume that it does not depend on 9. Then, the posterior predictive distribution does not depend on the distribution of X* , and it is given by formula (5). On a side note, if we consider X to be random with distribution which does not depend on 9, then the posterior predictive also does not change.

In order to find the U -estimate of y* for the quadratic loss function, we need to minimize [ | ]

R(d|X*,Y, X) = E [(y* — d)2 | X*,Y, X]

with respect to X*, Y. Unfortunately, it seems that there is no invertible solution of that optimization problem. However, if we consider only Y to be random, the U-estimate of y* is equal to the Bayesian estimate. Considering only X* to be random

a

gives a completely different result. Indeed, the posterior risk of the decision d can be expressed as

R(d|X*,Y, X) = E (y* - X*T —b )2 I X*,Y, X

Differentiating with respect to X* yield

+ (X*tOb - d)2 =

= X*TSX* + a2 + (X*T°b - d)

R'(d|X*, Y, X) = 2SX* + 26b (X*T6b - dj .

Now, we need to solve the equation R'(d|X*, Y, X) = 0. By performing some transformations, we can get a solution for d:

2SX* + 2 Ob (xTOB - d) =0,

°B SX* +

X* Ob - d =0

d=

°B SX*

0B 0B

T ——■ t SOb t

+ X* —B = X* --+ X* —B •

0Bé

The right side of the last expression is obtained by transposing the first term on the left side. This yields the linear dependence on X* of the U-estimate for y* : y* = X*T°u, where

S 9b

0U = 0 B +

0B —b'

The fixed values of Y do not significantly affect interpretation. For the fixed Y, the d-risk is an "average loss for particular decision value among a succession of experiments, given the training set Y". One can argue that this interpretation is even more natural and useful than the interpretation of d-risk for a random Y.

The problematic part here is the fact that hyperparameters a and t need to be known. In case of unknown hyperparameters, the ML-II procedure (maximum likelihood estimates) is popular. Another approach is to specify the distributions of a and t , and use the marginal likelihood in all derivations. This approach is useful because marginal likelihoods obtained that way are usually not very sensitive to changes of second-level prior distributions parameters. Unfortunately, all these techniques do not yield results which can be expressed with explicit formulas. It is also very difficult to apprehend their impact on the posterior distribution and U -estimates.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

2

3. Logit linear regression

In this section, we assume that dependent variables take only two values (0 and 1) and follow the logit linear regression model. The density of a single observation is

P(yi = 1|9,Xi) = a (XT9) ,

where sigmoid function is given by

a(x) = --.

v ' 1 + e-x

The likelihood function in this case is given by

p(Y10, X) = n a ((2y - 1)X?

The prior of 9 is assumed to be Gaussian: p(9) = y(9|0, t2). Here, we also consider X* to be random, and its distribution does not depend on 9. One such possibility is to consider X* Gaussian with the mean vector 0 and covariance matrix A.

Unfortunately, integrals and optimization problems are intractable. Binary linear regression is usually fitted by the numerical methods, regardless of the method of estimating (MLE and MAP estimates or Bayesian estimate).

The posterior density of 9 is given by (1). The posterior predictive of a new set of predictors X* is given by (3). Now, let us suppose that the loss function is given by

L(y*,d) = in y* = d 0, y* = d.

The Bayesian prediction rule for L predicts the value k that has the maximum predictive posterior probability of P(y* = k|X*, Y, X).

The next step is to predict the value of y* given the new input X* . For this purpose, we define the classification rule ^ = ^(X*,Y, X), so the values of ^ correspond to predicted values of y*. D-risks of the classification rule ^ for the 0-1 loss function are given by

R0(0) = P(y* = = 0), R0(1) = P(y* = 0№ = 1). These formulas can be expressed as

R0(k)=/ J P(y* = 1 — k|9, X*)p(9, X*|^ = k) dX* d9,

0 Rm

where

p(9,X*|^ = k) = P(<£ = k|9,X*)p(9)p(X*)^y J P(<£ = k|9,X*)p(9)p(X*) dX* d9.

0 Rm

This expression for d-risk is convenient, because we usually know the distribution P(<£ = k|9, X*).

In the case of finite decision spaces, U-estimates do not exist in most cases [1], so we need to use different definition of the optimal decision rule. The binary classification can be perceived as a problem of comparing two hypotheses. There exists [4] such a rule that for the given 0 < < 1 R<p* (0) < ^o and R^* (1) is minimal among all the rules ^ that satisfy R^(0) < . The most important thing here is that such a rule (in setting of the prediction problem of the current section) has the following form:

,* [1, t>c,

\0, T < C ,

for some constant C and T = P(y* = 1|X*, Y, X). Note that for the Bayesian prediction rule C = 0.5.

Of course, one can define ^o and numerically find C, for which R^* (0) < ^o . This is the case when one decision is more important than other. However, usually researchers do not distinguish different values of dependent variables, and the most natural way to find such C is that R^* (0) = R^* (1). It is possible for d-risks which are small enough, or, conversely, large n. The latter is due to the fact that d-risks tend to 0 as n ^ to , see [5].

Note that we can calculate d-risks of ^ for the fixed X* using

R0(k|X*) = P(y* = 1 - k|<£ = k,X*), k = 0,1.

(6)

Fig. 1. Sample d-risks (6) of for C = 0.4 and C = 0.5. Solid line is R^* (0) and dashed line is R^* (1)

In order to get , one needs to calculate expectation of the last expression with

respect to X* .

The expression (6) is very interesting as it shows which values of X* are the most risky in taking decisions. In order to see that, we set up a numerical example. We consider a simple regression model of the form

P(y = 1|9,Xi) = a (9o + xi9i),

where xi is a scalar value. Let (90, 91)T ~ N (0,412), xi,x* ~ N (0, 9), i = 1,...,n, n = 25. In the numerical Monte-Carlo experiment, we calculated frequencies of errors and calculated conditional d-risks (6). The results for with C = 0.4 and C = 0.5 are shown in Fig. 1. Due to the symmetry of distributions of all variables, it is expected that d-risk has also a symmetrical form. The values of x* with the largest d-risk are those closest to zero. It is also expected, because then y* takes values 0 and 1 with the expected probability of 0.5.

Conclusions

We studied the ways to construct the prediction rules, which minimize the d-risk. For the linear regression with Gaussian noise, we constructed the U-estimate for the regression parameter and U -estimate for prediction of y* of a new observation X* for random X* and fixed Y. For the linear logit regression, we suggested the prediction rule, which minimizes the maximum of two d-risks.

Acknowledgements. This work was funded by the subsidy allocated to Kazan Federal University for the state assignment in the sphere of scientific activities (project no. 1.7629.2017/8.9) (for Gaussian regression). The study was also supported by the Russian Foundation for Basic Research and the Republic Of Tatarstan according to the research project no. 17-41-160620 (for logit regression).

The work is performed according to the Russian Government Program of Competitive Growth of Kazan Federal University.

References

1. Volodin I.N., Simushkin S.V. On d-posteriori approach to the problem of statistical inference. Proc. 3rd Int. Vilnius Conf. on Probability Theory and Mathematical Statistics, 1981, vol. 1, pp. 100-101.

2. Volodin I.N., Simushkin S.V. Statistical inference with minimal d-risk. J. Sov. Math., 1988, vol. 42, no. 1, pp. 1464-1472. doi: 10.1007/BF01098858.

3. Scott J.G., Kelly R.C., Smith M.A., Zhou P., Kass R.E. False discovery rate regression: An application to neural synchrony detection in primary visual cortex. J. Am. Stat. Assoc., 2015, vol. 110, no. 510, pp. 459-471. doi: 10.1080/01621459.2014.990973.

4. Simushkin S.V. Optimal d-guarantee procedures for distinguishing two hypothesis. VINITI Acad. Sci. USSR, 1981, no. 55, pp. 47-81. (In Russian)

5. Volodin I.N., Novikov A.A. Asymptotics of the necessary sample size in testing parametric hypotheses: d-posterior approach. Math. Methods Stat., 1998, vol. 7, no. 1, pp. 111-121.

Received October 12, 2017

Zaikin Artyom Alexandrovich, Assistant of the Department of Mathematical Statistics Kazan Federal University

ul. Kremlevskaya, 18, Kazan, 420008 Russia E-mail: [email protected]

УДК 519.226.3

1 -Aпостериорный подход в регрессии А.А. Заикин

Казанский (Приволжский) федеральный университет, г. Казань, 420008, Россия

Аннотация

В статье представлена попытка применить d-апостериорный подход в регрессии. Так как регрессионные прогнозы являются по сути последовательностью схожих решений, это даёт возможность использования d -риска как меры качества прогнозирования. В работе изучаются различные подходы к применению d-апостериорного подхода для прогноза в регрессионных моделях. Предлагается подход, основанный на апостериорном прогностическом распределении переменной-регрессора в зависимости от значений переменных-предикторов. Для того чтобы интерпретация d-риска правила прогноза имела смысл, предлагается добавить в вероятностную модель распределение предикторов.

Эта методика была применена на двух простых регрессионных моделях. Сначала изучается линейная регрессия с гауссовским белым шумом. Для этой модели и для квадра-тической функции потерь были построены оценки с равномерно минимальным d-риском. Оказалось, что оценка параметра совпадает с байесовской оценкой, а прогноз несколько отличается. Далее рассматривается логистическая регрессия для бинарной зависимой переменной. Для функции потерь 1-0 не существует правила прогоноза, равномерно минимизирующего d-риск, поэтому предлагается правило, которое минимизирует максимум двух d-рисков. Полученные для обеих моделей правила сравниваются с известными решающими функциями, построенными согласно Байесовскому принципу и принципу максимального правдоподобия.

Ключевые слова: байесовская статистика, регрессия, й-риск

Поступила в редакцию 12.10.17

Заикин Артём Александрович, ассистент кафедры математической статистики Казанский (Приволжский) федеральный университет

ул. Кремлевская, д. 18, г. Казань, 420008, Россия E-mail: [email protected]

/ For citation: Zaikin A.A. d-Posterior approach in regression. Uchenye Zapiski ( Kazanskogo Universiteta. Seriya Fiziko-Matematicheskie Nauki, 2018, vol. 160, no. 2, \ pp. 410-418.

Для цитирования: Zaikin A.A. d-Posterior approach in regression // Учен. зап. Казан. ун-та. Сер. Физ.-матем. науки. - 2018. - Т. 160, кн. 2. - С. 410-418.

D-posterior approach in regression Текст научной статьи по специальности «Математика»

Аннотация научной статьи по математике, автор научной работы — Zaikin Artyom Alexandrovich

Похожие темы научных работ по математике , автор научной работы — Zaikin Artyom Alexandrovich

Текст научной работы на тему «D-posterior approach in regression»