USAGE OF STRATIFIED SAMPLING OF A CONTROL SUBSET FOR PREDICATIVITY IMPROVEMENT OF BOOSTED DECISION TREE MODELS

Pyrohov V.

Pyrohov V.

postgraduate of economic-mathematical modeling department Kyiv National Economic University named after Vadym Hetman

ABSTRACT

In the article has been conducted a research aiming increase of classification result stability of commercial bank's debtor creditworthiness with usage of boosted decision trees algorithm with application of stratified sampling.

Has been conducted an analysis of models and program implementation of boosted decision trees algorithm for estimation of commercial bank's debtor creditworthiness.

For confirmation of the results, has been used a program package LGBM on data of Home Credit Bank, available in the scope of Home Credit Competition on data science platform Kaggle.

During the research proposed to use stratified sampling of control dataset by target variable and the most significant characteristics during training of a model to increase a stability of the result of classification and enhance efficiency during a process of modernization of model's architecture.

Keywords: decision trees, gradient boosting, stratified sampling, XGBoost, LGBM, Kaggle

JEL: C12, C38, C51, C52, C53, C63

1. Preface

a) Current state of data science.

With the formation of the modern information society on the edge of XX - XXI c. the modern economics faced new challenges and opportunities. The updated economic system generates huge flows of information that can be used to obtain additional economic effect, obtaining added value through the correct interpretation of data using modern mathematical methods.

According to recent research [1], just in 20162017, humanity has generated more information than in the previous 5,000 years of human development.

Despite the large amount of information generated, only a small percentage of it is used to make operational decisions - 0.5% [1].

b) Relevance of the study.

There is a need to use data to generate added value, and therefore there is a need to create a scientific basis for the use of information to achieve economic goals. Today we are witnessing the emergence of such a science as Data Science.

Among the leading scientists working in the field of data science is possible to distinguish a work of Jeffrey Hinton [11, 12], Andrew Inya, Jan LeKun [14], Joshua Bengio, Peter Norvig [13], Jan Goodfellow [14] and others.

c) Kaggle - a research platform in the field of data science.

A young science needs both new tools, new methodology and new approaches to solve a specific range of problems which relates to it.

One of such modern tools is internet resources that specialize in solving problems related to data science. Currently, the largest and the most popular data science online hub is the Kaggle platform [2].

Kaggle is a platform for analytics and predictive modeling competitions in which statistics and data mining professionals compete to create the best models for predicting and describing data offered by companies or users. This crowdsourcing approach is based on the fact that there are many strategies that can be applied to any task of predictive modeling, and it is not known in advance which methodology or analytical approach will be the most effective [3].

d) Home Credit competition for risk of default classification.

The subject of the article is the theoretical basis and practical approaches to solving the problem of binary classification on open data of HomeCredit Bank in the Home Credit Default Risk competition [4].

As part of the Home Credit Default Risk competition, HomeCredit Bank provided data on loan applications for 2 retail loan products: consumer loans and credit cards. The specifics of the incoming sample was the selection of a population of unbankable customers whose loan applications would be denied under a one of the existing credit rules, but were credited by the bank to improve existing decision-making models and expand the coverage of potential borrowers.

Given that Home Credit Bank selected a sample of customers with low credit ratings to ensure sufficient predictive power of analytical models, the bank provided additional data sources, such as:

1) detailed behavioral information on the balance of existing and previous loans of the client and his payments according to the credit bureau and internal data of the bank;

2) information from real estate registers on the condition and average values of factors that characterize real estate owned by the client;

3) assessment of the client's region of residence;

Fig. 1. The structure of the input data of the Home Credit Competition (based on public data [5]).

The training sample provided by the bank for the construction of the forecast model included 307511 observations (see Table 1).

Table 1

Type of credit Number of observations % observations

Credit repaid 282 686 91.93%

Default 24 825 8.07%

Grand Total 307 511 100.00%

The test sample provided by the bank to validate the constructed model included 48744 observations.

2. Main part.

a) The method of boosted decision trees.

One of the most popular and effective algorithms used in data science competitions is the gradient boosted trees method.

A classic work that laid the theoretical foundation for the creation of boosting decision trees is the work of J. Friedman "Greedy approximation of functions: a gradient boosting machine" [6]

Friedman's work is based on the idea that the basic predicative model itself is "weak" and can be strengthened by constructing ensembles of models whose characteristics will be redefined using optimization algorithms (such as a gradient descent algorithm). Once the result of the final ensemble of models is aggregated, the original model is considered "strong" by reducing the variance of the original result and optimizing the parameters. The general representation of the original model will look like:

(1)

F(x; {bm, am}¥) = ^ bmh(x; am)

Let's consider the case where each basic model is a decision tree. In this case, each decision tree has an additive form:

J

(2)

h(x;{bj,Rj}Ji)= ^bjl(xERj)

j=i

In this example {Rj^ - is the space of the end nodes of the decision tree, which completely covers the range of values of the independent variable x.

Function indicator 1(*) has value 1 if its argument is true, and 0 otherwise.

The parameters of this basic model are the coefficients {bj} which define the boundaries of spaces

{Rj}^., which in its turn represent the distributions of non-end nodes of the tree.

For the decision tree, the definition of the boosting algorithm takes the form:

(3)

Fm(x) = Fm-1

(x) + Pm^ bjm1(x

=1

ERj)

where h(x; am) - parametric function with input variables x and parameters a = { a1, a2...}

Where [Rjm}1 - spaces which are defined by the end nodes of the decision tree during the iteration m.

m=1

The purpose of these spaces is to predict pseudo-responses (yJi.

pm - scaling factor for the linear search algorithm.

Formula (3) can be reduced to:

J

(4) Fm(x) = Fm-1(x) + ^ Yjml(x 6 Rj)

j=i

where Y jm = Pmbjm

In general, the algorithm can be described by the next cycle:

F0(x) = median[yi}i

For m = 1 to M do:

= sign(yi - Fm-1(xi)),i = 1,N {Rjm}[ = decision tree [y^x^

Yjm = medianx.eR.m[yi-Fm-i(xi)},j = 1,J l

Fm(x) = Fm-i(x) + ^ Yjml(x 6 Rj)

i=i

end For

end Algorythm [6]

Gradient boosting of decision trees creates competitive, reliable, interpreted models for solving classification problems, and good results are achieved even in conditions of low quality of input data.

b) Program packages XGBoost and LGBM.

XGBoost is an open source software library which supports the gradient boosting algorithm for C++, Java, Python, R, and Julia programming languages, created in 2014. The library runs on Linux, Windows, and macOS.

In addition to running on a single computer, XGBoost also supports distributed data processing structures such as Apache Hadoop, Apache Spark, and Apache Flink. It has gained a lot of popularity and attention recently, as this algorithm has been used by a significant number of winning teams in machine learning competitions.

XGBoost was founded as a research project within the Distributed Machine Learning Communities (DMLC) group [9]. Initially, the library was an application that could be customized using a configuration file. After winning the Higgs Machine Learning Challenge program, the library became famous in the machine learning competition circles. Packages for Python and R were soon added to XGBoost, and there are now packages for many other languages, such as Julia, Scala, Java, and more. The ability to use different programming languages has expanded the circle of developers and brought XGBoost popularity among the Kaggle community. The work on XGBoost was published by the library's authors, Tianqi Chen and Carlos Guestrin, on the scientific website arxiv.org and is freely available. [7]

LightGBM (LGBM) is a framework for gradient boosting that uses training algorithms for decision trees. LightGBM is a more modern optimized software implementation of the algorithm for greedy approximation of functions using decision trees. The main strengths of LightGBM:

1) Greater speed and efficiency of model learning.

2) Higher accuracy of the obtained models.

3) More efficient use of RAM during training of models.

4) Support for parallel learning and learning using graphics processors. [8]

A comparative analysis of the results on open datasets showed that LightGBM outperforms existing boosting frameworks in both efficiency and accuracy.

c) Description of the initial predicative models built during the Home Credit competition

The first stage in the preparation of the predicative model was the initial data processing and development of characteristics based on independent model variables.

During the development of characteristics were used:

1) mathematical and statistical approaches - the use of a set of mathematical functions for aggregation of available data related to customer loan service:

• average value;

• minimum;

• maximum;

• amount;

• standard deviation;

• number of unique records;

2) expert analysis of data based on the economic content of initial independent variables:

• Income per Person - the borrower's income per 1 member of his family.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

• Children ratio - the ratio of the number of children in the borrower's family to the total number of members of his family.

• Credit to Goods ratio - the ratio of the loan amount to the value of goods purchased in credit.

3) specific indicators used in the banking sector in the field of retail lending:

• DPD - days past due.

• DBD - days before due.

• Loan to Income ratio - the ratio of the loan amount to the borrower's income.

The total number of characteristics that were included in the final model - 591.

d) Using stratified sampling to create a control sample while training a boosted decision tree model.

During the creation of boosted predicative models, an important task of the researcher is the formation of a control sample.

The use of control sample during the preparation of the model makes possible to prevent overfitting of the predicative model - to prevent cases when the model is able to successfully recognize only a specific set of training data.

The process of creation of a control sample includes the selection of observations from the general set so that the control sample was as close as possible to the training subset in its properties.

To preserve the properties of the training set in the control set is used the method of stratified sampling.

Stratified sampling is a method of random selection that involves dividing the total population into smaller subgroups (strata) and conducting random sampling from strata. The strata are formed based on the

homogeneous characteristics of the population, which makes it possible to reproduce the heterogeneity of the total population in the sample.

A classic work describing the use of stratified sampling in statistics is the work of Neyman J. "On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection" [10].

The basic idea of stratified sampling: 1) splitting the heterogeneous sample into smaller groups, or strata (subpopulations), such that the selection groups are:

• homogeneous with respect to the target characteristics of the strata;

• heterogeneous in terms of target characteristics between strata;

2) random selection of observations from each stratum in accordance with the distribution of target characteristics of strata in the initial data;

The general approach to stratified selection is shown in Fig. 2.

Fig. 2. Description of stratified sampling of the subset (built by author).

e) XGBoost and LGBM results using stratified sampling.

During the research, an experiment was conducted on the use of stratified sampling to form a control sample for training of an economic-mathematical model using the boosted decision tree model.

To conduct the experiment, 3 selection methods were used for the control sample: 1) random selection;

Random

2) stratified selection by dependent variable;

3) stratified selection by dependent variable and the most significant variables of the model from the group EXT_SRC;

A common model architecture and input characteristics were used for all methods.

As a result of 5 iterations for each of the types of selection, the following results of LGBM models for the control sample (tables 2-4):

Table 2

selection

№ model 1 2 3 4 5

AUC 0.7963041 0.7902307 0.7859878 0.7872575 0.7750518

Variance d2 = 6.02121 * 10-5

Table 3

__Stratified selection by dependent variable__

№ model 1 2 3 4 5

AUC 0.7951976 0.7903783 0.7896288 0.7850145 0.7907957

Variance d2 = 1.314457 * 10-5

Table 4

_ Stratified selection by dependent variable and variables EXT SRC _

№ model 1 2 3 4 5

AUC 0.7934122 0.7911496 0.7871199 0.7895246 0.7876644

Variance d2 = 6.671408 * 10-6 Based on the obtained results, is possible to conclude that the use of stratified selection by indicators that have the greatest impact on the model, reduces the variance of the results of the model for the control sample.

This result makes possible to increase the stability of the model result, which is useful during the validation of the effectiveness of the modernization of model architecture.

The results of the constructed

The next stage of the experiment is to compare the predicative power of constructed models using different methods of selection of the validation sample. A test sample was used to assess predicative ability, which, according to the rules of the competition, was not available to researchers. The result of the prediction ability assessment for the test sample was on the side of the Kaggle system and was received based on the estimation of the target variable.

The result of the constructed models for the test sample is shown in table 5.

Table 5

models for the test sample

Type of validation set Result (AUC)

Validation subset Test subset

Random selection 0.78697 0.7907

Stratified selection by dependent variable 0.7902 0.79156

Stratified selection by dependent variable and variables EXT_SRC 0.78977 0.79354

As can be seen from the obtained results, even taking into account the ambiguous result obtained directly on the control sample (stratified selection only for the dependent variable showed a better result than selection for the dependent variable and the most significant model variables), in the test sample the predictive strength of the obtained model directly depends on the fact whether the selection was stratified or not when creating the control sample.

3. Conclusion.

Boosted decision tree models confidently hold the lead among the algorithms used in data classification competitions.

The most popular software packages used to create boosted decision tree models are the classic XGBoost and a more optimized solution using a similar algorithm - LGBM.

In the course of data classification competitions, there is a need to ensure a stable result, other things being equal, accordingly, there is a need to eliminate fluctuations in the distribution of characteristics in the control sample compared to the main data set.

The use of stratified sampling by the target variable and the most significant characteristics of the model is proposed to solve the described problem.

According to the results of the study is possible to conclude that:

1) additional stratification during the selection of the control sample has a positive effect on the predicative power of the model by maintaining the heterogeneity of the overall data set in the control sample;

2) in addition to the positive effect on predicative power, the use of stratified selection by the most significant indicators of the model, led to a decrease in the variance of the results of iterations of the model in the control sample.

The use of stratified sampling of the control sample during the training of boosted decision tree models

makes possible to increase the stability of the model result, which increases the efficiency of validation of modernization of the model architecture.

References

1. Harris R. More data will be created in 2017 than the previous 5,000 years of humanity. App Developer Magazine, 2016 - URL: https://appdevelopermaga-zine.com/more-data-will-be-created-in-2017-than-the-previous-5,000-years-of-humanity-/

2. Kaggle analytics and predictive modeling platform - URL: https://www.kaggle.com/

3. Kaggle. Wikipedia - URL: https://uk.wikipedia. org/wiki/Kaggle

4. Home Credit Default Risk. Kaggle - URL: https://www.kaggle.com/c/home-credit-default-risk

5. Home Credit Default Risk Competition Data Description. Kaggle - URL: https://www.kaggle.com/c/home-credit-default-risk/data

6. Friedman J.H. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, Vol. 29, No. 5 - C. 1189-1232 - URL: https://project-euclid.org/download/pdf_1/euclid.aos/1013203451

7. Chen T., Guestrin C. XGBoost: A Scalable Tree Boosting System. arXiv:1603.02754 - URL: https://arxiv.org/abs/1603.02754

8. LightGBM source code. Github - URL: https://github.com/Microsoft/LightGBM

9. Chen T. Story and lessons behind the evolution of XGBoost - URL: https://homes.cs.washing-ton.edu/~tqchen/2016/03/10/story-and-lessons-behind-the-evolution-of-xgboost.html

10. Neyman J. On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. Journal of the Royal Statistical Society, 97(4), 1934 - C. 558-625.

- URL: http://www.stat.cmu.edu/~brian/905-2008/pa-pers/neyman-1934-jrss.pdf

11. Krizhevsky A., Sutskever I., Hinton G. E. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing, 25, MIT Press, Cambridge - URL: http://www.cs.toronto.edu/~hinton/absps/imagenet.pdf

12. Salakhutdinov R.R., Mnih A., Hinton, G.E. Restricted Boltzmann Machines for Collaborative Filtering. International Conference on Machine Learning,

Corvallis, Oregon, 2007 - URL: http://www.cs.to-ronto.edu/~hinton/absps/netflix.pdf

13. Russell S.J., Norvig P. Artificial Intelligence: A Modern Approach, New Jersey: Prentice Hall, 2003. - (2nd edition).

14. Goodfellow I., Bengio Y., Courville A. Deep Learning (Adaptive Computation and Machine Learning series), Cambridge: The MIT Press - 2016

ФОРМУВАННЯ ДЕРЖАВНОГО БЮДЖЕТУ ЯК 1НСТРУМЕНТ СОЦ1АЛЬНО-ЕКОНОМ1ЧНОГО РОЗВИТКУ КРАШИ

Чугунов 1.Я.,

завгдувач кафедри фтанав, доктор економ1чних наук, професор, заслужений дгяч науки i техшки Украши Кшвський нацюнальний торговельно-економiчний утверситет

Ттарчук М.1. аспiранm кафедри фтанав, Кшвський нацюнальний торговельно-економiчний утверситет

FORMATION OF THE STATE BUDGET AS AN INSTRUMENT OF SOCIO-ECONOMIC

DEVELOPMENT OF THE COUNTRY

Chugunov I.,

Kyiv National University of Trade and Economics, Head of the Department of Finance, Doctor of Economics, Professor, Honored Worker of Science and Technology of Ukraine

Titarchuk M.

Kyiv National University of Trade and Economics, postgraduate student of the Department of Finance

АНОТАЦ1Я

У статп розкрито сутшсть формування державного бюджету як шструменту соцiально-економiчного розвитку краши Визначено доцшьнють розвитку пiдходiв до середньострокового бюджетного планування у системi фiнансово-економiчних вщносин. Показано необхщшсть узгодження рiвня податкового наванта-ження, структури оподаткування iз моделлю економiчного розвитку краши, пвдвищення ефективностi ви-користання бюджетних коштiв, упорядкування видаткiв бюджету та подальший розвиток фшансово-бю-джетних iнститутiв. Розвинуто пiдходи до подальшого посилення системностi та послiдовностi у реалiзацiï державно1 полiтики у сферi планування видаткiв бюджету, визначення оптимальноï структури та змiсту бюджетних програм, ïx вiдповiднiсть напрямам соцiально-економiчного розвитку краши, положення щодо дieвого мехашзму розподiлу видаткiв в розрiзi бюджетноï класифшацп на засадах результативностi.

ABSTRACT

The article reveals the essence of the formation of the state budget as an instrument of socio-economic development of the country. The expediency of developing approaches to medium-term budget planning in the system of financial and economic relations is determined. The necessity of harmonization of the level of tax burden, tax structure with the model of economic development of the country, increase of efficiency of use of budgetary funds, streamlining of budget expenditures and further development of financial and budgetary institutions is shown. Approaches to further strengthening the system and consistency in the implementation of state policy in the field of budget expenditure planning, determining the optimal structure and content of budget programs, their compliance with the socio-economic development of the country, provisions for effective mechanism for allocating expenditures in terms of budget classification on the basis of effectiveness.

Ключовi слова: державний бюджет, формування бюджету, бюджетш вщносини, бюджетна полгтика, бюджетний мехашзм, соцiально-економiчний розвиток краши.

Keywords: state budget, budget formation, budget relations, budget policy, budget mechanism, socio-economic development of the country.

Постановка проблеми. Бюджетна полгтика краши адаптусться до еволюци сусшльних потреб, перетворень в економщ та сощальнш сфер^ змш eKOHOMi4HOÏ кон'юнктури. Важливим е посилення

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

дieвостi регуляторного механiзму державного бюджету в умовах iнституцшноï' трансформацп еконо-мiчних вiдносин. Вирiшення завдань у сферi фшан-сового забезпечення соцiальноï полiтики, стимулю-вання внутршнього попиту, в достатнiй мiрi,

USAGE OF STRATIFIED SAMPLING OF A CONTROL SUBSET FOR PREDICATIVITY IMPROVEMENT OF BOOSTED DECISION TREE MODELS Текст научной статьи по специальности «Экономика и бизнес»

Аннотация научной статьи по экономике и бизнесу, автор научной работы — Pyrohov V.

Похожие темы научных работ по экономике и бизнесу , автор научной работы — Pyrohov V.

Текст научной работы на тему «USAGE OF STRATIFIED SAMPLING OF A CONTROL SUBSET FOR PREDICATIVITY IMPROVEMENT OF BOOSTED DECISION TREE MODELS»