Zero-inflated boosted ensemble for small count problem

Borisov A.E.; Tuv E.V.

y/j,k 519.23

ZERO-INFLATED BOOSTED ENSEMBLE FOR SMALL COUNT

PROBLEM

Intel corporation Nizhnhy Novgorod, Russia

e-mail: [email protected], [email protected]

Abstract. The article introduces a new approach for modeling "small count data" where distribution of the response variable is assumed to follow the zero-inflated Poisson (ZIP) model. ZIP model based on boosted ensemble is introduced. It combines and extends ZIP tree and gradient boosting tree (GBT) methods. Our algorithm, called ZIP-GBT, is at first introduced from theoretical perspective in the framework of Friedman's gradient boosting machine. Then it is compared empirically on two real data sets and two artificial data sets versus single tree approach (ZIP-tree). It is shown that ZIP-GBT outperforms ZIP tree in most cases both in terms of cross validated ZIP-likelihood and ZIP distribution parameters prediction.

Introduction

The analysis of count data is the primary interest in many areas including public health, epidemiology, sociology, psychology, engineering, and agriculture. Poisson distribution is typically assumed to model the distribution of the rare event counts. The Poisson regression model is commonly used to explain the relationship between the count (non-negative integer) response and input variables (predictors). However, it is often the case that the outcome of interest contains excess number of zeros which cannot be explained correctly by the standard Poisson model.

Lambert [1] successfully proposed a mixture of the distribution with a point mass at zero and a Poisson distribution, called zero-inflated Poisson (ZIP) regression, to handle zero-inflated count data in a number of defects in a manufacturing process. After Lambert

[I] successfully introduced the zero-inflated Poisson (ZIP) model, many extensions or modified ZIP models were elaborated. For example Wang [18] proposed Markov zero-inflated Poisson regression (MZIP), Li et.al [3] introduced multivariate ZIP models, Lee and Jin [2] proposed a tree-based approach for Poisson regression, Chiogna and Gaetan

[II] used semi-parametric ZIP in animal abundance studies, Hsu [13] proposed a weighted ZIP, and Famoye and Singh [12] used zero-inflated generalized Poisson (ZIGP) regression model when the count data is over-dispersed. ZIP regression is not only applied in the manufacturing, but it is also widely used in many other areas such as public health, epidemiology, sociology, psychology, engineering, agriculture, etc. ([17], [16], [14], [10], [19]).

In data mining, tree-based model is one of the most popular and common methods used for approximating target functions, in which a function can be learned by splitting the data set into subsets based on an response-attribute value test. This process is repeated on each derived subset in a recursive manner and is represented by a tree model. Each terminal node is assigned a response value. A popular method of tree-based regression and

classification is called CART (Classification and Regression Tree) [9,4], In 2006, Lee and Jin [2] introduced ZIP-tree model. They modified CART algorithm splitting criteria by using the zero-inflated Poisson (ZIP) likelihood error function instead of residual sum of squares. Each terminal node of ZIP tree is assigned its own ZIP distribution parameters (zero inflation probability p and Poisson distribution parameter A),

Further development of the idea of using trees for ZIP regression leads to using a tree ensemble instead of a single tree. Ensemble methods are very popular in literature and widely used in practice, especially parallel (Random Forest, or RF, see [7,8]) and boosted tree ensembles (like AdaBoost or GBT[20,21]), Tree ensembles are shown to have smaller prediction error (bias) than a single tree; parallel ensembles (RF) also offer more stability (smaller variance).

In this paper, we propose a boosted ensemble approach similar to GBT that fits ZIP distribution parameters p, A using two tree ensembles. The algorithm minimizes ZIP log-likelihood loss function by gradient descent method similar to the one proposed by Friedman for multi-class logistic regression (MCLRT) [21], Our algorithm uses the loglink function for A and the logit-link function for p as proposed in [1] for standard ZIP regression.

Lambert [1] used ZIP distribution for response variable y, where Poisson distribution parameters depend on the values of input variables:

where parameters A,p are obtained from the linear combinations of inputs via log- and logit-link functions :

Here x is input feature vector (for simplicity of notation we always assume that a "dummy" variable x = 1 is added as the first input variable to take the intercept term into account), /3,7 are vectors of coefficients (same for each data point) to be fit. The ZIP model is usually fitted using the maximum likelihood estimation method.

Log-likelihood can be maximized using Newton-Raphson method, but usually EM algorithm is used because it is more robust and computationally simpler. The authors study the behavior of their algorithm on AT&T Bell Labs soldering data. Another article [6] applies the same ZIP regression model to DMFT(decayed, missing and filled teeth) data. They use piecewise constant model for p parameter. Both articles consider using mixture models (most popular is mixture of Poisson and negative binomial distributions), but

1, Previous work : ZIP regression and ZIP tree

J 0, with probability pi, ^ \ Poisson(Xi), with probability 1 — Pi, i = 1... n,

where n is the number of samples. This model implies that

Iк

both authors claim that such models are more difficult to fit and usually provide worse predictions than ZIP models.

In 2006, Lee and Jin [2] used the ZIP likelihood as new splitting criteria for decision tree. They modified CART (classification and regression tree) algorithm by using negative ZIP likelihood as an impurity measure in a node. Negative ZIP likelihood of the data in node T can be expressed as

Lzip(T) = Lzip{p, A, y) = • log (p + (1 - p)e'x) - (n - n0) • (log(l - p) - A) -

- Y^ m ■ A + Mi/iO,

Xi^T Xi£T,yi> 0

where p, A are estimates of Poisson distribution parameters in node T,

The new splitting criterion is based on the difference of the ZIP likelihood in the left-child node and the right-child node from the ZIP likelihood in the parent node. The expression for split weight can be written as <j)(s,T) = LZjp{T) — Lzip(Tl) — LZIP(TR), where T is the parent node, TL,TR are left and right children of T, s is the split in node T, The same best split search strategy can be applied as in CART,

Parameter A for a tree node is estimated using zero-truncated Poisson distribution:

1 A = y = mean(yi\yi > 0,Xi ET).

After A parameter is obtained, p can be estimated from the known proportion of zero-class samples in the node :

no/n — erx

where n0 is the number of zero count samples and n is total number of samples,

2, Boosting framework and ZIP boosted ensemble

Trees usually provide robust models for complex target functions (not limited by linearity assumptions) and are not sensitive to noise and outliers. They also allow working with mixed type data (both numeric and categorical predictors) and handle missed values in a natural way, CART trees are also very fast to fit (they do not require complex matrix operations as MLE problem). That is why trees are widely used in real life applications where data sets are mixed-type, large in number of samples and predictors, and noisy (both input variables and the response).

However, a single tree often has low predictive power (especially if an underlying target model is complex and multivariate) and is not stable to small fluctuations in the data. So different authors proposed using ensembles of trees for regression and classification problems (L, Breiman introduced parallel ensembles, or Random Forests [7,8], and J.H, Friedman introduced boosted ensembles [20, 21]), Ensembles have much higher predictive accuracy and generalization ability while keeping all advantages of the single tree. So ensemble methods become more and more popular as "off-the-shelf" approach and often provide as good results as best state-of-art methods.

Random Forest, a parallel ensemble, is a set of trees, with each of the trees build on a different (random) subsample of training data. In each node when searching for best split

only a small subset of input variables is selected randomly. Prediction from a set of trees is obtained using averaging prediction over trees in regression or voting in classification. Gradient boosting, in its general form, constructs an additive regression (or logistic regression) model by sequentially fitting a simple parameterized function (a base learner that can be a tree or any other model) to current "pseudo-residuals" at each iteration. The pseudo-residuals are the gradient of the loss functional minimized with respect to the current parameter values, with respect to the model values at each training data point evaluated at the current step. Let's describe gradient boosting framework more formally. Suppose we have a training sample {yi,Xi}i=i.„n,Xi = {xn,..., Xim} e X, yi e Y, where n is the number of samples, m is the number of input variables. Our goal is to find a function F*(x) : X —^ Y that minimizes expected value of the specified loss function L(y,F(x)) over the joint distribution of x,y values :

F*(x) = argmin EXiV L(y, F(x)).

F (x)

Here the expectation term cannot usually be computed directly as the joint distribution of x, y is not known. So in practice it is replaced with expected risk, i.e. :

n

F*(x) = argmin^ L(yi, F(xi)).

F(x) i=1

Boosting uses an additive model to approximate F*(x) : F*(x) = h(x, am),

where function h(x, om) is some simple function ("base learner") of parameter vector a. Base learner parameters am,m = 1...M are fit in forward stepwise manner. It starts from some initial approximation F0(x), then proceeds as follows :

n

am = argminJ^L(yi,Fm_i(a;i) + h(xi,a)), (1)

a . i

— F'm—1 "I" h(x, Q>ra) •

Gradient boosting solves optimization problem (1) using the stepwise steepest descent method. The function h(x, a) is fit by least squares:

n

am = argmin- h(xi,a))2

a . i

to the current "pseudo-residuals" or "pseudo-response" :

'dL{yi,F{xi))

Угт

(2)

-^(з?)—Fm_ i (ж)

dF(xi)

Gradient tree boosting is the specialization of this approach to the case where base learner is a CART regression tree :

dL(yi,F(xi)

dF(xi)

, i = 1 , , , n

F(x)=Fm-x(x)

Algorithm 1 : Gradient tree boosting

1. Fq(x) = argmin^"=1L(yi,7)

7

2, For m = 1 to M do:

3 ■ Vim —

4. {Rim}i=i...L = L— terminal node tree

5. 5. 7im = atg mil) _ L(m- /•'„, i(.r,;) +7)

7 m

6. Fm{x) = Fm-i(x) + V • 7imI{x e Rim)

7. End for

Here 7im is the response (mean) in node Rim. Parameter v is "shrinkage rate" or "regularization parameter" that controls learning rate of the algorithm. Smaller values for shrinkage (0,01-0,1) have proven to reduce over-fitting, thus allowing building models with better generalization ability. Usually only random part of samples (about 60%) are used to learn a tree on step 4 (bootstrapping). This speeds up model building and also reduces over-fitting,

A particularly interesting case of the algorithm 1 is a two-class logistic regression (that also has multi-class generalization that we will omit). It is derived from gradient tree boosting framework when using CART tree as a base learner, and negative binomial log-likelihood as the loss function.

Assume that response is binary, y e { — 1,1}, and the loss function is negative binomial log-likelihood : L(y,F) = log(l + exp(^2yF)), where F is a two-class logistic

Pr(y=i[g) ^ gQ eacji ^ree approximates log-odd of class 1

erived from formula (2) or step 3 of Algorithm 1 is

Pr(y=—l|œ)

transform : F(x) = \ log

probability, and the pseudo-response c Vim = 2^/(1 + exp(2yiFm_i(a;i)).

Optimization problem on step 5 cannot be solved in closed form, so single Newton-Raphson step approximation is used :

llr,

Vim = \V™\ ' (2 ~ \V™\) (3)

To increase the robustness of GBT algorithm, influence trimming can be applied when selecting samples for building a subsequent tree. Suppose we want to estimate the response in a terminal node on step 5 of Algorithm (1) via equation

Ol.dii.F,,, ,(.r,;) + --)A'h =0.

& i ^ R-lm

Influence of z-th sample on the solution can be gauged by the second derivative of the loss function, i.e

Wi = w(xi) = d2L(yi, Fm^i(xi)+nf) / d^f2 \1=o = d2 L(yh f) / df2\f=Fm_l{xi) = |yimK2-|yim|). When building subsequent tree, we omit all observations with Wi < wt(a), where 1(a) is

the solution to Y^i=lw(i) = a ' YliLiwi (here weights w^ are w{'s sorted in ascending order), and a is usually chosen in [0,05,02] range.

Influence trimming not only speeds up the tree construction, but also improves robustness of Newton-Raphson method step (equation 4), preventing small denominator values for a tree node, because denominator is proportional to the sum of sample influences in the node. Influence trimming

Now we are ready to derive our own algorithm for the ZIP regression problem using negative ZIP-likelihood as a loss function, and CART model as a base learner. We use two ensembles of trees to approximate transformed Poisson distribution parameters (p, A), We use the same transformation (link function) as used by Lambert[1], i.e log-link for p and logit-link for A :

p = log (p/( 1 -p)),p = e"/(l + e"),

v = log(A), A = eu.

So the first ensemble fits model for p(x), the second one - for v(x). Initial value for v is estimated from zero truncated Poison distribution of the response :

y = mean{yi\yi > 0), u0 = log(A0),

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

1

then (jlq as

-Ac

Po

m/n

-Ac

,p0 = log it(po)

1 - e-x°

where n0 is the number of zero-class samples (y^ = 0), The loss function to be minimized takes the form

L(y,p, X) = ELi HvhPh = - E№=o 1os (pi + (! - PiY - 0 (log(l - Pi) ^ Ai) - E№>0 Vi log Ai + E№>0 log(yi!),

where we denoted pi = p(xi)J\i = A(xi) to simplify notation. Last term is not dependent on the model and can be dropped. In other terms,

L(y,P, A) = L(y,fi, V) = Ya=lL(yi;lH;Vi) =

= ~ £Vi=o (!og(ew + exp(-e"0 - log(l + e«)) + E,;>o (log(l + e>H) + e«) - E,;>0 VM,

where pi = p(xi), щ = v(xi).

Pseudo-responses are calculated as follows. Pseudo-response for p-ensemble is

l-1'im — 'dL(yhp dpi ищу 14— l(x),Vi=Vn -l(x)

dL{yi,pi, А -n-l(Xi)Y 'dp(Pi)'

dpi Pi=Pm-l(x) dpi m=Hm-l(x)

Ja-pOeU -Pitt-Pi)' ^ =

1/(1 - Pi) - Pitt- Pi) = Pi Hi > 0,

where pi = pm-i(xi), Aj = Am_i(a;i). Here pseudo-response is expressed in terms of Pi,Xi to simplify notation.

Pseudo-response for A-ensemble is derived in the same way (note that Mfi = e7(l + e^2=p(l^p), = =

_ j \ie-x>/ (Pi/( 1 - Pi) + e-x>) , Vi = 0, im ~ I A, - yh y% > 0.

Then node response optimization problem on step 5 of algorithm 1 is solved via single step of Newton-Raphson as in Friedman's two class logistic regression. Unfortunately in our case Hessian (second derivative) can be negative sometimes, although such occasions are rare and possibly indicate over-fitting or "self-contradictory" data i.e a case when data points with similar x values have very different (p, A) values. Negative Hessian means that the target function is not concave and thus cannot be approximated by 2-nd order polynomial. In such case we use one step of steepest descent instead of Newton-Raphson step. Second derivatives for p-tree (which are summands in denominator in formula (4)) are:

d2L(yi, ¡j,m_i{xi) +l,vm-i{xi))/d^i2\1=(i = d2L(yi,fii,uni_1(xi))/df4\IH=llm_l{xi) =

= = { -¿Xoe'4 ' ■^ " ^ ' ^ " * = 0'

[ Pi(l-Pi), Vi > o. Same for A-tree:

d2L(yi,fj,m^1(xi),um^1(xi) +7)/<972|7=0 = <92L(yi; (¿m-ifa), n) / dv2\Ui=Um_l{xi) = Ai(l - Pi) • \pit(i-pi)%y!) > Vi =

Ai, yi > 0.

The formula (4) for "optimal" response in p-tree terminal node will look like (n(Rjm) is the count of training samples in node Rjm):

1 _ J Aim/ ZXi€Rjm P'im, YlxieRjm Aim > £ = 10 6,

\ T,XieRjm iMm/n(Rjm), otherwise.

same for A-tree :

72 = i W EXl,Rjm 4», EXl,Rjm km > e,

\ T,XleRjm W^jm), otherwise.

There are several tricks that we use to improve the numeric stability of the algorithm. To prevent ^ from causing numerical overflow or underflow we simply threshold them by a reasonable constant (log(FLT_MAX/2) for example). We also adopted influence trimming strategy to prevent very small Hessian by absolute value in a tree node. We found that one cannot remove samples with negative loss function second derivative because it can harm severely the performance of the algorithm. However one can trim samples with

second derivative small by absolute value in p-tree, So we do no influence trimming for A-tree (as small absolute value of the second derivative of the loss function is not likely to happen there), and do influence trimming with weights Wi = Pi( 1 — Pi) for p-tree in the same way as it is described earlier for two-class logistic regression,

3, Evaluation

First we validate our algorithm and compare its performance with our implementation of ZIP tree on two artificial data sets. Both data sets are generated from a known model for ZIP distribution parameters (p, A) with a small amount of random noise added, i.e

p = p(xi,x2) • (1 + e • Ml), Ml e U(-1,1),

A = X(xi,x2) ■ (1 + £ ■ M2), M2 G £/( — 1,1).

Then response value yi is generated from ZIP distribution with parameters (.Pi = p(xHi x2i), Aj = X(xii,x2i)). In all three experiments three values for noise level e = 0,0,2,0,5 are used.

The first data set uses linear model for (p, A) :

p = 0.2 + 0.6- (0.3a;i + 0.7x2), A = 1.5 + 7- (O.G.r, +0. lr,).

The second data set uses more complex highly nonlinear model

logit(p) = 2sin(20a;i) + 3x2 • (x2 — 0.5),

log(A) = sin(30a;i) + 3x2.

For each model we report the base error (error for the best constant model), training error, and cross-validation error (5-fold), where error is average negative ZIP log-likelihood, and average absolute difference (on the training set) between "true" and "predicted" parameters (p, A), we also report average relative difference for A parameter. Three last numbers show how well ZIP distribution parameters are approximated by the model. In all experiments the model complexity (which is the pruning step for the tree and the number of iterations for GBT) is selected using best CV error. Size of all data sets is 10000 samples.

For artificial data sets, the following parameters are used : ZIP TREE : tree^depth = 6, min^split = 50, min^bucket = 20. ZIP GBT : nit = 1000, tree^depth = 3, min^split = 400, min^bucket = 200, shrinkage = 0.01, infl^trimming = 0.1.

Here tree^depth is a maximum tree depth (node is not split if it is at the specified depth), min^split is a minimum size of the node that will be split (if it has less observations it is NOT split), min^bucket is a minimum size of the terminal node (split is not accepted if it creates a terminal node with smaller size), nit = is a maximum number of iterations for an ensemble, shrinkage is the v parameter (regularization) on step 6 of Algorithm 1, infl^trimming is a threshold for influence trimming.

Base error column in the following table shows negative ZIP log-likelihood for the best constant model, train and CV-error are train and 5-fold cross-validation errors (negative ZIP log-likelihood also), Sp is average absolute difference in predicted

p parameter (Sp = — P(xii;x2i)\/n where p(x 1,2:2) is prediction

from the model), SA is an average absolute difference in predicted A parameter = Yli=1 X2i)^X(xu, X2i)\/n), SXrei is an average relative difference in predicted A

parameter (6\rel = |1 - X{xu, x2i)/\{xu, x2i)\/n).

Table 1. Comparison of ZIP tree and ZIP GBT on two artificial data sets.

Data Noise (e^ Base error Model Train error CV error Sp a sxrel Best step

LINEAR 0 1.801 TREE 1.663 1.690 0.043 0.355 0.074 16

GBT 1.653 1.675 0.027 0.182 0.038 413

0.2 1.859 TREE 1.707 1.736 0.043 0.416 0.092 15

GBT 1.702 1.721 0.032 0.179 0.040 284

0.5 1.873 TREE 1.744 1.775 0.040 0.441 0.093 13

GBT 1.733 1.754 0.032 0.234 0.049 319

NON- 0 2.920 TREE 1.535 1.675 0.146 3.105 0.403 49

LINEAR GBT 1.360 11413 0.058 1.844 0.255 999

0.2 3.037 TREE 1.594 1.735 0.156 3.027 0.423 35

GBT 1.425 1.492 0.064 1.810 0.247 999

0.5 3.310 TREE 1.774 1.925 0.154 3.112 0.394 42

GBT 1.577 1.663 0.073 1.812 0.253 998

This table shows that GBT is always superior to a single tree in terms of train error, CV error and ZIP distribution parameters prediction error. One can see that over-fitting (difference between CV and train errors) is much smaller for GBT, especially for bigger noise levels and more complex models.

Then we compared performance of ZIP GBT to ZIP tree on two public available reallife data sets. The first one is SOLDER, which is a part of rpart R free package, the second is DMFT (decayed, missing and filled teeth) data set used in [6], On SOLDER data set, ZIP GBT is much better than a single tree in terms of cross-validated log-likelihood, on DMFT GBT is only slightly better. Parameters of both algorithms were adjusted manually to minimize cross-validation error :

ZIP TREE : tree^depth = 6, min^split = 15, min^bucket = 10,

ZIP GBT : nit = 1000, tree depth = 3, min^split = 30, min^bucket = 20,

shrinkage = 0.02(0.005 for DMFT), iniM rimming = 0.1.

It can be seen that GBT has much smaller CV error on SOLDER data set and a little smaller on DMFT data set.

Conclusion

This article introduces gradient boosting model for small-count regression problem, where response is assumed to follow ZIP distribution. This model uses gradient tree boosting concepts introduced by Friedman for regression and classification and extends

Table 2. Comparison of ZIP tree and ZIP GBT on real-life data.

Data Base error Model Train error CV error Best step

SOLDEE 4.464 TREE 2.493 2.714 9

SOLDEE 4.464 GBT 1.510 1.818 765

DM FT 1.789 TREE 1.525 1.577 8

DM FT 1.789 GBT 1.499 1.564 660

them to ZIP model. It is shown that the algorithm performance (both in terms of log-likelihood value and prediction of ZIP distribution parameters as function of inputs) is superior to the performance of ZIP tree.

The algorithm can be adapted to different problems using different link functions. Further analysis of the algorithm performance and comparison to other small count data models on large real-life data that comes from Intel manufacturing processes is of great interest.

References

1. D. Lambert. Zero-inflated Poisson regression with an application to defects in manufacturing// Technometrics, 34(1), pp. 1-14, 1992.

2. S. Lee and S. Jin. Decision tree approaches for zero-inflated count data// Journal of applied statistics, 33(8), pp. 853-865, 2006.

3. C. Li, J. Lu, and J. Park. Multivariate zero-inflated Poisson models and their applications// Technometrics, 41(1), pp. 29-38, 1999.

4. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York , 2001.

5. D. Bohning, E. Dietz, P. Schlattman, L. Mendonca, and IJ. Kirchner. Testing parameter of the power series distribution of a zero inflated power series model// Statistical Methodology, 4, pp. 393-406, 2007.

6. D, Bohning, E, Dietz ,P, Schlattman, L, Medonca and IJ,Kirchner. The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology// J.R.Statist.Soc A(1999) 162, part 2, pp.195-209.

7. L. Breiman. Bagging predictors// Machine Learning, 24, pp. 123-140, 1996.

8. L. Breiman. Random forests// Machine Learning, 45, pp. 5-32, 2001.

9. L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and regression trees. Chapman and Hall/CRC, New York, 1998.

10. Y.B. Cheung. Zero-inflated models for regression analysis of count data: a study of growth and development// Statistics in Medicine, 21, pp. 1461-1469, 2002.

11. M. Chiogna and C. Gaetan. Semiparametric zero-inflated poisson models with application to animal abundance studies// Environmetrics, 18, pp. 303-314, 2007.

12. F. Famoye and K.P. Singh. Zero-inflated generalized Poisson regression model with an application to domestic violence data.// Journal of Data Science, 4, pp. 117-130, 2006.

13. C. Hsu. A weighted zero-inflated poisson model for estimation of recurrence of adenomas// Statistical Methods in Medical Research, 16, pp. 155-166, 2007.

14. K. Hur, D. Hedeker, W. Henderson, S. Khuri, and J. Daley. Modeling clustered count data with excess zeros in health care outcomes research// Health Services and Outcomes Research Methodology, 3, pp. 5-20, 2002.

15. C. Li, J. Lu, and J. Park. Multivariate zero-inflated Poisson models and their applications// Technometrics, 41(1), pp. 29-38, 1999.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

16. C.S. Li, J.C. Lu, J. Park, K. Kim, P.A. Brinkley, and J.P. Peterson. Multivariate zero-inflated Poisson models and their applications. // Technometrics, 41(1), pp. 29-38, 1999.

17. R. Ramis Prieto, J. Garcia-Perez, M. Pollan, N. Aragones, B. Perez-Gomez, and G. Lopez-Abente. Modelling of municipal mortality due to haematological neoplasias in Spain // Journal of epidemiology and community health, 61(2), pp. 165-171, 2007.

18. P. Wang. Markov zero-inflated poisson regression models for a time series of counts with excess Zeros// Journal of Applied Statistics, 28(5), pp. 623-632, 2001.

19. K. K. Yau and A.H. Lee. Zero-inflated poisson regression with random effects to evaluate an occupational injury prevention programme// Statistics in Medicine, 20(19), pp. 2907-2920, 2001.

20. J.H. Friedman. Greedy function approximation : a gradient boosting machine// Technical report, Dept. of Statistics, Stanford University, 1999.

21. J.H. Friedman. Stochastic gradient boosting// Computational Statistics and Data Analysis, 38(4), pp. 367-378, 2002.

Статья поступила в редакцию 25.04-2008

Zero-inflated boosted ensemble for small count problem Текст научной статьи по специальности «Математика»

Аннотация научной статьи по математике, автор научной работы — Borisov A.E., Tuv E.V.

Похожие темы научных работ по математике , автор научной работы — Borisov A.E., Tuv E.V.

Текст научной работы на тему «Zero-inflated boosted ensemble for small count problem»