Rongji,
St. Andrew's Episcopal School E-mail: [email protected]
DEVELOPMENT OF A MODEL TO IMPROVE BUSINESS PROFITABILITY
Abstract: Data from the consumer credit risk domain provided by CompuCredit was used to build binary classification models to predict the likelihood of default. This project compares two of these models, one uses four variables, and the other one uses twelve variables. Although the original dataset had several hundred predictor variables and more than a million observations, I chose to use rather simple models. My goal was to develop a model with as few predictors as possible, while not going lower than a concordant level of 65%. Two models were evaluated and compared based on efficiency, simplicity, and profitability. Using the selected model, cluster analysis was then performed in order to maximize the estimated profitability. Finally, the analysis was taken one step further through a supervised segmentation process, in order to target the most profitable segment of the best cluster.
Keywords: business profitability; model; data; financial product.
1. Introduction Logistic regression is a part of a category of sta-
For any business in the world, profitability is a tistical models called generalized linear models, and
key indicator of the success of a business. In order to develop a model to predict the profit of a business, a large number of variables can be required to build a robust model. However, this will reduce the efficiency of the model. If a model with a small number ofvariables can be developed while still keeping the model robust is a significant improvement. The first objective of this study is to develop a model that can predict the default of a financial product with a small number of variables; the second objective is to estimate the profitability that can be generated by this model.
2. Data and Method Datasets from a financial institute is used. 2.1 Logistic Regression Model Before the model is built, random samples were taken from the dataset to create two independent files, one is a training file and the other one is a validation file. The model is built with logistic regression by defining binary variable to predict the likelihood of default.
it allows one to predict a discrete outcome from a set of variables that may be continuous, discrete, di-chotomous, or a combination of these. Typically, the dependent variable is dichotomous and the independent variables are either categorical or continuous.
To develop the predictive model for mortgage delinquency, stepwise logistic regression was conducted using SAS version 9.2. In this study, a cutoff of P value < 0.05 was used for adding new variables. If an account is over 30 days delinquent (value of variable Current Loan Delinquent Status is >= 1), the outcome variable Delinquent is coded as 1, otherwise it is coded as 0. Logistic regression relates changes in the natural logarithm of the odds of being depressed to the changes in the independent variables [7].
The logistic regression model can be expressed with the formula:
ln (W-1) = ft + + ?2*X2 + ••• .+ KX«
where P is the probability of being depressed, fl0 is a constant, ^ through are the regression
coefficients and X1 through Xn are the independent variables, such as age, sex, factors of physical activity, factors of dietary habit, factors of physical activity, factors of smoking, factors of alcohol use, factors of drug use, etc. For simplicity, the left-hand side of the equation is often referred to as "the logit". The interpretation of the coefficients describes the independent variable's effect on the natural logarithm of the odds, rather than directly on the probability P.
To facilitate interpretation, eP", a transformation of the original regression coefficient can be derived, which can be interpreted as follows: If eP" > 1, P/(1-P) increases. If eP" < 1, P/(1-P) decreases.
If eP" =1, P/(1-P) stays unaffected.
After the logistic regression model was obtained, the observations in the training dataset were scored, the predicted probabilities of being depressed were ranked and deciled, KS was calculated, and gain chart was presented.
2.2 Model validation
The holdout dataset was scored with the logistic regression model, the predicted probabilities of being depressed were ranked and deciled, Kolmogorov-
Smirnov (abbreviated as KS) was calculated, and gain chart were presented. If the KS of the validation dataset is close to the KS of the training dataset, the model is considered stable.
The goal is to build a model with as few predictors as possible, while not going below a concordant level of 65%. Hundreds predictors were initially placed into a model, using the backward selection option in the Proc Logistic procedure. Variables showing no effect were removed. Variables were selected based on the highest Chi-Square value. Several models were developed and tested. 2.3 Model Comparison
Through the process of model development and validation, two models were selected for comparison. The two models were compared both on efficiency and profitability. Profitability reports were generated for each model using a profitability function. The cost of simplicity was an important factor in determining which model would be the best. 3. Results and Discussion 3.1 Four-Variables Model The four-variables logistic regression model is developed and the results are as follows.
Table 1. - Analysis of Maximum Likelihood Estimates
Parameter Parameter Description DF Estimate Standard Error Wald Chi-Square Pr > ChiSq
Intercept 1 -0.05250 0.25820 0.04 0.8389
X1 Utilization of all revolving bankcard trades 1 0.00058 0.00025 5.23 0.0223
X2 Highest utilization on Any Single Bank Revolving Trade 1 0.00099 0.00039 6.46 0.0111
X3 Total Collection/Charge Off/ Repossession Dollars Within 12 Months 1 0.00005 0.00002 10.57 0.0011
X4 Percent of Trades Never Delinquencies or Derogatory 1 -0.03620 0.00310 136.36 < 0.0001
Four-Variables Model Performance
All four variables are significant in the level of 0.05. The model can achieve percent of concordant 68.8 and Area Under Curve 0.688.
Table 2. - Association of Predicted Probabilities and Observed
Responses
Percent Concordant 68.8 Somers' D 0.377
Percent Discordant 31.1 Gamma 0.378
Percent Tied 0.2 Tau-a 0.147
Pairs 2.082.730 c 0.689
ROC Curve for Selected Model
Area Underthe Curve = 0.6886
1.00 -0 75 -
0.50 -0.25 -0.00 -
I
0.00 0.25 0.50 0.75 1.00
1 - Specificity
Fidure 1.
The gains table is tabulated as below that KS achieves 0.542
Table 3.
Decile Freq Cum Freq Default Cum Default Mean Default Cum Default Rate Default Capture Rate Min Score Max Score Mean Score KS
1 2 3 4 5 6 7 8 9 10 11 12
1 245 245 80 80 0.327 0.327 0.439 0.172 1.000 0.356 0.366
2 245 490 46 127 0.190 0.259 0.693 0.082 0.170 0.116 0.533
3 246 736 20 147 0.080 0.199 0.801 0.049 0.082 0.062 0.542
4 245 982 6 153 0.025 0.155 0.835 0.038 0.049 0.043 0.470
5 246 1.227 9 162 0.038 0.132 0.885 0.031 0.038 0.034 0.416
6 245 1.473 15 177 0.062 0.120 0.969 0.029 0.031 0.029 0.399
1 2 3 4 5 6 7 8 9 10 11 12
7 245 1.718 3 180 0.011 0.105 0.984 0.027 0.029 0.028 0.307
8 245 1.963 2 182 0.009 0.093 0.996 0.026 0.027 0.026 0.212
9 246 2.209 1 183 0.003 0.083 0.999 0.025 0.026 0.025 0.108
10 246 2.455 0 183 0.000 0.074 1.000 0.025 0.025 0.025 0.000
Total 2.455 2.455 183 183 0.074 0.074 1.000 0.025 1.000 0.074 0.542
Calculation of the Profitability of the Four- An example here is if we are using the cutoff prob-
variables Model
If we think when the predict default probability is greater than a given number, then it would a bad account; otherwise, it would be a good account. We can assign GoodBad to the scored data.
Table 4.
ability of 0.116 from the mean score at first 2 decile, we can classify the model development data into 4 categories: ERROR1, ERROR2, VALID1 and VALID2. The profitability can be listed in the below table.
Outcome type Percentage n profit pper1000
ERROR1 17% 571 ($105.833.14) ($185.347.01)
ERROR2 10% 327 $0.00 $0.00
VALID1 9% 295 $0.00 $0.00
VALID2 64% 2078 $506.009.54 $243.507.96
Total 100% 3271 $400.176.39 $122.340.69
Here ERROR1 is a category that we assign the account be good but it actually a bad account, then we loss $105.833.14 on 571 accounts. It is equivalent that we loss $185.347.01 on 1000 accounts; ERROR2 is a category that we assign the account be bad but it actually a good account, then we don't lose money, neither earn money; VALID1 is a category that we assign the account be bad and it actually a bad account, then we successfully avoid loss; VALID2 is a category that we assign the account
be good and it actually a good account, then we successfully earn $506.009.54 on 2078 accounts. It is equivalent that we earn $243.507.96 on 1000 accounts. This is a winning business that we earn $400.176.39 on the total 3271 accounts; that is equivalent that we earn $122.340.69 on 1000 accounts.
3.2 Twelve-Variables Model
A logistic regression model using 12-varibles can be established by as below.
Table 5. Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Wald Pr > ChiSq
Error Chi-Square
1 2 3 4 5 6
Intercept 1 -1.63090 0.38640 17.81 <.0001
X1 1 0.00138 0.00032 19.16 <.0001
X2 1 0.00716 0.00170 17.66 <.0001
X3 1 0.00006 0.00002 10.45 0.0012
1 2 3 4 5 6
X4 1 -0.01380 0.00441 9.82 0.0017
X5 1 -0.01310 0.00435 9.11 0.0025
X6 1 0.08310 0.02800 8.83 0.003
X7 1 0.00777 0.00270 8.31 0.0039
X8 1 0.00000 0.00000 7.44 0.0064
X9 1 0.16170 0.06240 6.71 0.0096
X10 1 0.07540 0.02930 6.62 0.0101
X11 1 0.03660 0.01600 5.24 0.0221
X12 1 -0.00001 0.00001 4.94 0.0263
Twelve-Variable Model Performance
All 12 variables are significant in the level of 0.05. The model can achieve percent of concordant 72 and Area Under Curve 0.72.
Table 6. - Association of Predicted Probabilities and Observed
Responses
Percent Concordant 72 Somers' D 0.44
Percent Discordant 28 Gamma 0.44
Percent Tied 0 Tau-a 0.171
Pairs 2.082.730 c 0.72
The ROC curve for comparison is plotted as
ROC Curves for Comparisons
1.00 -
0.75 -
=
m 0.50 -
EZ
aj
in
0.25 -
0.00 -
0.00 0.25 0.50 0.75 1.00
1 - Specificity
ROC Curve (Area)
-Model (0.7199)
-4-Variable Model (0.6886)
- 12-Variable Model (0.7199)
Figure 2.
The ROC curve shows that there is a bit differ- 0.75. Will this difference a big impact on probabil-ence between the probabilities of the two models, ity on profitability? We will further explore the gains especially when 1-Specificity is between 0.25 and table, tabulated as below.
Table 7.
Decile Freq Cum Freq Default Cum Default Mean Default Cum Default Rate Default Capture Rate Min Score Max Score Mean Score KS
1 245 245 97 97 0.398 0.398 0.532 0.207 1 0.400 0.467
2 246 490 42 139 0.170 0.284 0.760 0.101 0.207 0.143 0.606
3 246 736 17 156 0.067 0.211 0.851 0.062 0.101 0.079 0.595
4 245 981 12 168 0.050 0.171 0.918 0.038 0.062 0.047 0.560
5 246 1227 7 175 0.029 0.143 0.957 0.023 0.038 0.030 0.493
6 245 1472 4 179 0.015 0.121 0.977 0.015 0.023 0.019 0.408
7 246 1718 1 180 0.004 0.105 0.982 0.010 0.015 0.012 0.305
8 245 1964 1 181 0.004 0.092 0.988 0.006 0.010 0.008 0.203
9 245 2209 2 183 0.008 0.083 0.999 0.003 0.006 0.005 0.108
10 246 2455 0 183 0.001 0.074 1 0.000 0.003 0.002 0.000
Total 2455 2455 183 183 0.074 0.074 1 0.000 1 0.074 0.606
The Gains and Lift charts show only a small advantage of the 12-variable model over the simpler one. KS achieves 0.60.6.
Calculation of the Profitability of the Twelve-variables
Similar to 4-variable model profitability calculation, if we think when the predict default probability is greater than a given number, then
it would a bad account; otherwise, it would be a good account. We can assign GoodBad to the scored data. An example here is if we are using the cutoff probability of 0.143 from the mean score at first 2 decile, we can classify the model development data into 4 categories: ERROR1, ERROR2, VALID1 and VALID2. The profitability can be listed in the below table.
Table 8.
outcome type pct n profit pper1000
ERROR1 0.16631 544 ($98.427.90) ($180.933.65)
ERROR2 0.0963008 315 $0.00 $0.00
VALID1 0.0984408 322 $0.00 $0.00
VALID2 0.6389483 2090 $510.778.27 $244.391.52
1 3271 $412.350.37 $126.062.48
Here ERROR1 is a category that we assign the account be good but it actually a bad account, then we loss $98.427.9 on 544 accounts. It is equivalent that we loss $180.933.65 on 1000 accounts; ERROR2 is a category that we assign the account be bad but it actually a good account, then we don't lose money, neither earn money; VALID1 is a category that we
assign the account be bad and it actually a bad account, then we successfully avoid loss; VALID2 is a category that we assign the account be good and it actually a good account, then we successfully earn $510.779.27 on 2090 accounts. It is equivalent that we earn $244.391.52 on 1000 accounts. This is also winning business that we earn $412.350.37 on the
total 3271 accounts; that is equivalent that we earn $126.062.48 on 1000 accounts;
The profit difference on 1000 accounts base is $3.721.79. Whether to use 4-variables model or use 12-variables model would depend on how much it could cost in complexity when increase number of predictors from 4 to 12.
4. Conclusion
The research paper built two logistic models to increase profitability through simplicity on con-
sumer lending business. Two models are compared in predicting the likelihood of default. Two models were evaluated and compared based on concordance, AUC, KS, efficiency, simplicity, and profitability. It indicates that simple model can improve the efficiency of a business, while still maintaining the profitability. In practice, the business decision of adopting a simple model will depend on the cost and incremental complexity to implement the model.
References:
1. Borrillo C. M., Boris N. W. Mood disorders. In: Kliegman R. M., Behrman R. E.,Jenson H. B., Stanton B. F., eds. Nelson Textbook of Pediatrics. 18th ed. Philadelphia, PA: Saunders Elsevier; 2007. Chap 25.
2. Refer to URL: http://www.lib.wsc.ma.edu/webapa.htm
3. National YRBS Data Users Manual 2009.
4. Peng C.J., Lee K. L., Ingersoll G. M. An Introduction to Logistic Regression Analysis and Reporting. The Journal of Educational Research, 96(1), - P. 3-14.
5. Tabachnick B., and Fidell L. Using Multivariate Statistics (4th Ed.). Needham Heights, MA: Allyn & Bacon, 2001.
6. Stat Soft. Electronic Statistics Textbook. URL: http://www.statsoft.com/textbook/stathome.html. http://www.statsoft.com/textbook/stathome.html.
7. Stokes M., Davis C. S. Categorical Data Analysis Using the SAS System, SAS Institute Inc., 1995.