Shen Kenneth, The Wardlaw-Hartridge School, NJ
Edison, NJ
E-mail: [email protected]
MAXIMIZING PROFITABILITY THROUGH MODEL SIMPLICITY AND CLUSTER ANALYSIS
Abstract. Machine-learning techniques were used to construct forecasting models of consumer credit risk. Using mimic data from consumer credit risk domain, binary logistic regression was used to build the models to predict the likelihood of default. The goal was to develop a model with as few predictors as possible, while not going lower than a concordant level of 65%. This paper compares a 4-variable model and a 12-variable model based on simplicity and profitability. Using the selected model, cluster analysis was then performed to maximize the estimated profitability. The 4-variale model achieves a profit $122.340.69 on 1000 accounts. KS of the model is 0.542. The 12-variable model achieves profit $126.062.48 on 1000 accounts. KS of the model is 0.606. The profit difference on 1000 accounts base is only $3.721.79. The Cluster1 segment of 4-variable model achieves profit $143.616.62, which is determinant as the best segment.
Keywords: Machine-learning, logistic regression, consumer credit risk, K-means model.
1. Introduction
While developing a product or service, a predictive statistical model is needed to maximize the profitability of a product or service. While a predictive statistical model should be as accurate as possible to predict the likelihood of default, a statistical model with too many predictors can also cost company both time and money. It takes time to collect data, so it is reasonable to assume that it would cost additional time if a more complex model was selected. Data collection also costs money, and the more variables there are in a model, the more data would need to be acquired. The purposes of this paper are to explore the cost of simplicity and how a predictive statistical model can be maximized to increase a company's profitability.
2. Method
2.1 Logistic Regression
• Define binary variable to predict the likelihood of default using the binary response variable GoodBad;
• Before building the models, random samples taken from the dataset were partitioned into
two independent files: a training dataset and a validation dataset;
• Models were developed and tested using the backward selection option in Proc Logistic procedure;
• Through the process of model development and validation, 4-variale and 12-varaible models were selected for comparison.
2.2 Model Comparison
• ROC curves, Gains tables and KS test were generated for each model;
• Data was classified into four categories ERROR!, ERROR2, VALID1 and VALID2 using a selected cutoff probability;
• Profitability reports were generated for each model using a profitability function.
2.3 Cluster Analysis using K-means model
• D ata was standardized using Proc Stdize procedure with range method;
• K-means was used to partition data into 3 clusters. The K-means method identifies 3 centroids, and then allocates every data point
to the nearest cluster, while keeping the cen-troids as small as possible;
• Canonical discriminant analysis was performed using Proc Candisc procedure;
• The most profitable subpopulation to target was identified.
Table 1.- Analysis of maximum likelihood estimates for 4-variable model
Analysis of Maximum Likelihood Estimates
Parameter Parameter Description DF Estimate Standard Error Wald Chi-Square Pr > ChiSq
Intercept 1 -0.05250 0.25820 0.04 0.8389
X1 Utilization of all revolving bankcard trades 1 0.00058 0.00025 5.23 0.0223
X2 Highest utilization on any single bank revolving trade 1 0.00099 0.00039 6.46 0.0111
X3 Total collection/charge off/ Repossession dollars within 12 months 1 0.00005 0.00002 10.57 0.0011
X4 Percent of trades never delinquencies or derogatory 1 -0.03620 0.00310 136.36 <.0001
3.2 4-Variable Model Performance The gains table (See Table 3) is tabulated as be-
All 4 variables are significant in the level of 0.05. low that KS achieves 0.542. The model can achieve percent concordant of68.8 and Area Under Curve of0.688 (S ee Table 2 and Figure 1).
Table 2.- Association of predicted probabilities and observed for 4-variable model
Association of Predicted Probabilities and Observed
Responses
Percent Concordant 68.8 Somers' D 0.377
Percent Discordant 31.1 Gamma 0.378
Percent Tied 0.2 Tau-a 0.147
Pairs 2.082.730 c 0.689
Table 3.- Gains table for 4-variable model
Decile Default Cum Default Mean Default Cum Default Rate Default Capture Rate Min Score Max Score Mean Score KS
1 2 3 4 5 6 7 8 9 10
1 80 80 0.327 0.327 0.439 0.172 1.000 0.356 0.366
2 46 127 0.190 0.259 0.693 0.082 0.170 0.116 0.533
3 20 147 0.080 0.199 0.801 0.049 0.082 0.062 0.542
4 6 153 0.025 0.155 0.835 0.038 0.049 0.043 0.470
5 9 162 0.038 0.132 0.885 0.031 0.038 0.034 0.416
3. Result
3.1 4-Variable Model
A logistic regression model using 4-varible was established as below (See Table 1).
1 2 3 4 5 6 7 8 9 10
6 15 177 0.062 0.120 0.969 0.029 0.031 0.029 0.399
7 3 180 0.011 0.105 0.984 0.027 0.029 0.028 0.307
8 2 182 0.009 0.093 0.996 0.026 0.027 0.026 0.212
9 1 183 0.003 0.083 0.999 0.025 0.026 0.025 0.108
10 0 183 0.000 0.074 1.000 0.025 0.025 0.025 0.000
Total 183 183 0.074 0.074 1.000 0.025 1.000 0.074 0.542
Figure 1. ROC curve for 4-variable model
3.3 4-variable Model Profitability Calculation ample here is if the cutoffprobability of 0.116 from the If it is assumed when the predict default prob- mean score at the second decile is used, the model de-ability is greater than a given number, it would be a velopment data can be classified into 4 categories: ER-bad account; otherwise, it would be a good account. ROR1, ERROR2, VALID1 and VALID2. The profit-GoodBad can be assigned to the scored data. An ex- ability can be listed in the below table (See Table 4).
Table 4.- Profitability table for 4-variable model
Outcome type Percentage n Profit Profit per 1000 account
ERROR1 17% 571 ($105.833.14) ($185.347.01)
ERROR2 10% 327 $0.00 $0.00
VALID1 9% 295 $0.00 $0.00
VALID2 64% 2078 $506.009.54 $243.507.96
Total 100% 3271 $400.176.39 $122.340.69
Here ERROR1 is a category in which the ac- counts are actually bad, so $105.833.14 is lost on 571 counts are assigned to be good. However, the ac- accounts. It is equivalent that $185.347.01 is lost on
1000 accounts. ERROR2 is a category in which the accounts are assigned to be bad, but they are actually good accounts, so money is neither lost nor earned; VALID1 is a category in which the accounts are assigned to be bad and the accounts are actually bad, so a lost is successfully avoided; VALID2 is a category in which the accounts are assigned to be good and they are actually good accounts, so $506,009.54 is successfully earned on 2078 accounts. It is equiva-
lent that $243.507.96 is earned on 1000 accounts. This is a winning business in which $400.176.39 can be earned on the total 3271 accounts; equivalently, $122.340.69 can be earned on 1000 accounts.
3.4 4-Variable Model K-means Cluster and Profitability
Customers are clustered into 3 clusters. A SAS procedure FASTCLUS using K-means method was performed.
Figure 2: Plot of canonical variables identified by cluster value
The resulting plot (See Figure 2) illustrates the If the same cutoff probability of 0.116 at the sec-
spatial separation of the clusters calculated in the ond decile is applied to the Cluster1 segment, there
FASTCLUS procedure. Here blue circles represent is the profitability table (See Table 5) below. Higher
the Cluster1, which is assumed to be the best seg- profit can be achieved. The Cluster1 is determined
ment in profit. as the best segment.
Table 5.- Profitability table for Clusterl segment
Outcome type Percentage n Profit pper1000
ERROR1 0.1972097 523 ($96.893.07) ($185.264.00)
ERROR2 0.0343137 91 $0.00 $0.00
VALID1 0.0286576 76 $0.00 $0.00
VALID2 0.739819 1962 $477.764.34 $243.508.84
Total 1 2652 $380.871.27 $143.616.62
3.5 12-Variable Model A logistic regression model using 12-varible was
established as below (See Table 6). Table 6.- Analysis of maximum likelihood estimates for 12-variable model
Analysis of Maximum Likelihood Estimates
Parameter Parameter Description DF Estimate Standard Wald Pr > ChiSq
Error Chi-Square
Intercept 1 -1.63090 0.38640 17.81 <.0001
X1 Utilization of all revolving bankcard trades 1 0.00138 0.00032 19.16 <.0001
X2 Highest utilization on any single bank revolving trade 1 0.00716 0.00170 17.66 <.0001
X3 Total collection/charge off/ repossession dollars within 12 months 1 0.00006 0.00002 10.45 0.0012
X4 Percent of trades never delinquencies or derogatory 1 -0.01380 0.00441 9.82 0.0017
X5 Trades open greater than or equal to 1-year payment ratio 1 -0.01310 0.00435 9.11 0.0025
X6 Inquiries in last 6 months 1 0.08310 0.02800 8.83 0.003
X7 Aggregate utilization of revolving trades 1 0.00777 0.00270 8.31 0.0039
X8 Aggregate credit limit on revolving trades 1 0.00000 0.00000 7.44 0.0064
X9 Number of 30 DPD trades reported within 2 years 1 0.16170 0.06240 6.71 0.0096
X10 Number of30-180 DPD within 6 months 1 0.07540 0.02930 6.62 0.0101
X11 Number of revolving trades with high utilization 1 0.03660 0.01600 5.24 0.0221
X12 The average credit limit of trades 1 -0.00001 0.00001 4.94 0.0263
3.6 12-Variable Model Performance 72 and Area Under Curve of 0.72 (See Table 7 and
All 12 variables are significant in the level of Figure 3). 0.05. The model can achieve percent concordant of
Table 7.- Association of predicted probabilities and observed for 12-variable model
Association of Predicted Probabilities and Observed
Responses
Percent Concordant 72 Somers' D 0.44
Percent Discordant 28 Gamma 0.44
Percent Tied 0 Tau-a 0.171
Pairs 2.082.730 c 0.72
ROC Curves for Comparisons
1.00
0.75
>
% 0.50
cz dl
Lfl
0.25
0.00 -
0.00 0.25 0.50 0.75 1.00
1 - Specificity
ROC Cuive (Area)
-Model (0.7199)
-4-Variable Model (0.6886)
- 12-Variable Model (0.7199)
Figure 3. ROC curve for comparison
The ROC curve above shows that there is very 0.25 and 0.75. Will this difference be a big impact little difference between probabilities of the two on profitability? The gains table (See Table 8) will models, especially when 1-Specificity is between be explored further, tabulated as below.
Table 8.- Gains table for 12-variable model
Decile Default Cum Default Mean Default Cum Default Rate Default Capture Rate Min Score Max Score Mean Score KS
1 97 97 0.398 0.398 0.532 0.207 1 0.400 0.467
2 42 139 0.170 0.284 0.760 0.101 0.207 0.143 0.606
3 17 156 0.067 0.211 0.851 0.062 0.101 0.079 0.595
4 12 168 0.050 0.171 0.918 0.038 0.062 0.047 0.560
5 7 175 0.029 0.143 0.957 0.023 0.038 0.030 0.493
6 4 179 0.015 0.121 0.977 0.015 0.023 0.019 0.408
7 1 180 0.004 0.105 0.982 0.010 0.015 0.012 0.305
8 1 181 0.004 0.092 0.988 0.006 0.010 0.008 0.203
9 2 183 0.008 0.083 0.999 0.003 0.006 0.005 0.108
10 0 183 0.001 0.074 1 0.000 0.003 0.002 0.000
Total 183 183 0.074 0.074 1 0.000 1 0.074 0.606
The Gains and Lift charts show only a small ad- wise, it would be a good account. GoodBad can be
vantage of the 12-variable model over the simpler assigned to the scored data. For example, if the cutoff
one. KS achieves 0.606. probability of 0.143 from the mean score at second
3.7 12-variable Model Profitability Calculation decile is used, the model development data can be
Similar to 4-variable model profitability calcula- classified into 4 categories: ERROR1, ERROR2,
tion, when the predict default probability is greater VALID1 and VALID2. The profitability is listed in
than a given number, it would a bad account. Other- the below table (See Table 9).
Table 9.- Profitability table for 12-variable model
Outcome type pct n Profit pper1000
ERROR1 0.16631 544 ($98.427.90) ($180.933.65)
ERROR2 0.0963008 315 $0.00 $0.00
VALID1 0.0984408 322 $0.00 $0.00
VALID2 0.6389483 2090 $510.778.27 $244.391.52
1 3271 $412.350.37 $126.062.48
Here ERROR1 is a category in which the accounts are assigned to be good. However, the accounts are actually bad, so $98.427.9 is lost on 544 accounts. It is equivalent that $180.933.65 is lost on 1000 accounts; ERROR2 is a category in which the accounts are assigned to be bad but they are actually good accounts, so money is neither lost nor earned; VALID1 is a category in which the accounts are assigned to be bad and they are actually bad accounts, so a loss is successfully avoided; VALID2 is a category in which the accounts are assigned to be good and the accounts actually good accounts, so $510.779.27 is successfully earned on 2090 accounts. It is equivalent that $244.391.52 is earned on 1000 accounts. This is also winning business in $412.350.37 is earned on the total 3271 accounts; equivalently, $126.062.48 can be earned on 1000 accounts. 4. Discussion
The profit difference on 1000 accounts base is $3.721.79. It appears the 12-variable model has a
References:
1. Credit Default Risk Prediction. Available at: URL:https://repods.io/en/blog/Credit-default-risk-pre-diction
2. Modern Machine Learning Algorithms: Strengths and Weaknesses. Available at: URL:https://elitedata-science.com/machine-learning-algorithms
little advantage over the 4-variable model. However, the cost in term of time and money also needs to take into consideration. Using 4-variable model or 12-variable model would depend on how much it could cost in complexity when the number of predictors is increased from 4 to 12.
5. Conclusion
The research paper built two logistic models in predicting the likelihood of default. Two models were evaluated and compared based on concordance, AUC, KS, simplicity, and profitability. No recommendation is provided on which model is a better choice to a company, but the final profitability that each model can give is calculated. It will depend on the cost and incremental complexity to implement the models. The analysis also finished an unsupervised clustering process, targeting the most profitable cluster segment.
3. Peng C.J., Lee K. L., Ingersoll G. M. An Introduction to Logistic Regression Analysis and Reporting. The Journal of Educational Research, 96(1),- P. 3-14.
4. Fawcett T. An introduction to ROC analysis [J]. Pattern recognition letters, 2006; 27(8): 861-874.
5. Stokes M., Davis C. S. Categorical Data Analysis Using the SAS System, SAS Institute Inc., 2001.
6. SAS/STAT® 15.1 User's Guide the FASTCLUS Procedure. Available at URL:https://support.sas.com/ documentation/onlinedoc/stat/151/fastclus.pdf