Научная статья на тему 'Maximizing profitability through model simplicity and cluster analysis'

Maximizing profitability through model simplicity and cluster analysis Текст научной статьи по специальности «Медицинские технологии»

CC BY
79
37
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
MACHINE-LEARNING / LOGISTIC REGRESSION / CONSUMER CREDIT RISK / K-MEANS MODEL

Аннотация научной статьи по медицинским технологиям, автор научной работы — Shen Kenneth

Machine-learning techniques were used to construct forecasting models of consumer credit risk. Using mimic data from consumer credit risk domain, binary logistic regression was used to build the models to predict the likelihood of default. The goal was to develop a model with as few predictors as possible, while not going lower than a concordant level of 65%. This paper compares a 4-variable model and a 12-variable model based on simplicity and profitability. Using the selected model, cluster analysis was then performed to maximize the estimated profitability. The 4-variale model achieves a profit $122.340.69 on 1000 accounts. KS of the model is 0.542. The 12-variable model achieves profit $126.062.48 on 1000 accounts. KS of the model is 0.606. The profit difference on 1000 accounts base is only $3.721.79. The Cluster1 segment of 4-variable model achieves profit $143.616.62, which is determinant as the best segment.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Maximizing profitability through model simplicity and cluster analysis»

Shen Kenneth, The Wardlaw-Hartridge School, NJ

Edison, NJ

E-mail: kenshen12.26@gmail.com

MAXIMIZING PROFITABILITY THROUGH MODEL SIMPLICITY AND CLUSTER ANALYSIS

Abstract. Machine-learning techniques were used to construct forecasting models of consumer credit risk. Using mimic data from consumer credit risk domain, binary logistic regression was used to build the models to predict the likelihood of default. The goal was to develop a model with as few predictors as possible, while not going lower than a concordant level of 65%. This paper compares a 4-variable model and a 12-variable model based on simplicity and profitability. Using the selected model, cluster analysis was then performed to maximize the estimated profitability. The 4-variale model achieves a profit $122.340.69 on 1000 accounts. KS of the model is 0.542. The 12-variable model achieves profit $126.062.48 on 1000 accounts. KS of the model is 0.606. The profit difference on 1000 accounts base is only $3.721.79. The Cluster1 segment of 4-variable model achieves profit $143.616.62, which is determinant as the best segment.

Keywords: Machine-learning, logistic regression, consumer credit risk, K-means model.

1. Introduction

While developing a product or service, a predictive statistical model is needed to maximize the profitability of a product or service. While a predictive statistical model should be as accurate as possible to predict the likelihood of default, a statistical model with too many predictors can also cost company both time and money. It takes time to collect data, so it is reasonable to assume that it would cost additional time if a more complex model was selected. Data collection also costs money, and the more variables there are in a model, the more data would need to be acquired. The purposes of this paper are to explore the cost of simplicity and how a predictive statistical model can be maximized to increase a company's profitability.

2. Method

2.1 Logistic Regression

• Define binary variable to predict the likelihood of default using the binary response variable GoodBad;

• Before building the models, random samples taken from the dataset were partitioned into

two independent files: a training dataset and a validation dataset;

• Models were developed and tested using the backward selection option in Proc Logistic procedure;

• Through the process of model development and validation, 4-variale and 12-varaible models were selected for comparison.

2.2 Model Comparison

• ROC curves, Gains tables and KS test were generated for each model;

• Data was classified into four categories ERROR!, ERROR2, VALID1 and VALID2 using a selected cutoff probability;

• Profitability reports were generated for each model using a profitability function.

2.3 Cluster Analysis using K-means model

• D ata was standardized using Proc Stdize procedure with range method;

• K-means was used to partition data into 3 clusters. The K-means method identifies 3 centroids, and then allocates every data point

to the nearest cluster, while keeping the cen-troids as small as possible;

• Canonical discriminant analysis was performed using Proc Candisc procedure;

• The most profitable subpopulation to target was identified.

Table 1.- Analysis of maximum likelihood estimates for 4-variable model

Analysis of Maximum Likelihood Estimates

Parameter Parameter Description DF Estimate Standard Error Wald Chi-Square Pr > ChiSq

Intercept 1 -0.05250 0.25820 0.04 0.8389

X1 Utilization of all revolving bankcard trades 1 0.00058 0.00025 5.23 0.0223

X2 Highest utilization on any single bank revolving trade 1 0.00099 0.00039 6.46 0.0111

X3 Total collection/charge off/ Repossession dollars within 12 months 1 0.00005 0.00002 10.57 0.0011

X4 Percent of trades never delinquencies or derogatory 1 -0.03620 0.00310 136.36 <.0001

3.2 4-Variable Model Performance The gains table (See Table 3) is tabulated as be-

All 4 variables are significant in the level of 0.05. low that KS achieves 0.542. The model can achieve percent concordant of68.8 and Area Under Curve of0.688 (S ee Table 2 and Figure 1).

Table 2.- Association of predicted probabilities and observed for 4-variable model

Association of Predicted Probabilities and Observed

Responses

Percent Concordant 68.8 Somers' D 0.377

Percent Discordant 31.1 Gamma 0.378

Percent Tied 0.2 Tau-a 0.147

Pairs 2.082.730 c 0.689

Table 3.- Gains table for 4-variable model

Decile Default Cum Default Mean Default Cum Default Rate Default Capture Rate Min Score Max Score Mean Score KS

1 2 3 4 5 6 7 8 9 10

1 80 80 0.327 0.327 0.439 0.172 1.000 0.356 0.366

2 46 127 0.190 0.259 0.693 0.082 0.170 0.116 0.533

3 20 147 0.080 0.199 0.801 0.049 0.082 0.062 0.542

4 6 153 0.025 0.155 0.835 0.038 0.049 0.043 0.470

5 9 162 0.038 0.132 0.885 0.031 0.038 0.034 0.416

3. Result

3.1 4-Variable Model

A logistic regression model using 4-varible was established as below (See Table 1).

1 2 3 4 5 6 7 8 9 10

6 15 177 0.062 0.120 0.969 0.029 0.031 0.029 0.399

7 3 180 0.011 0.105 0.984 0.027 0.029 0.028 0.307

8 2 182 0.009 0.093 0.996 0.026 0.027 0.026 0.212

9 1 183 0.003 0.083 0.999 0.025 0.026 0.025 0.108

10 0 183 0.000 0.074 1.000 0.025 0.025 0.025 0.000

Total 183 183 0.074 0.074 1.000 0.025 1.000 0.074 0.542

Figure 1. ROC curve for 4-variable model

3.3 4-variable Model Profitability Calculation ample here is if the cutoffprobability of 0.116 from the If it is assumed when the predict default prob- mean score at the second decile is used, the model de-ability is greater than a given number, it would be a velopment data can be classified into 4 categories: ER-bad account; otherwise, it would be a good account. ROR1, ERROR2, VALID1 and VALID2. The profit-GoodBad can be assigned to the scored data. An ex- ability can be listed in the below table (See Table 4).

Table 4.- Profitability table for 4-variable model

Outcome type Percentage n Profit Profit per 1000 account

ERROR1 17% 571 ($105.833.14) ($185.347.01)

ERROR2 10% 327 $0.00 $0.00

VALID1 9% 295 $0.00 $0.00

VALID2 64% 2078 $506.009.54 $243.507.96

Total 100% 3271 $400.176.39 $122.340.69

Here ERROR1 is a category in which the ac- counts are actually bad, so $105.833.14 is lost on 571 counts are assigned to be good. However, the ac- accounts. It is equivalent that $185.347.01 is lost on

1000 accounts. ERROR2 is a category in which the accounts are assigned to be bad, but they are actually good accounts, so money is neither lost nor earned; VALID1 is a category in which the accounts are assigned to be bad and the accounts are actually bad, so a lost is successfully avoided; VALID2 is a category in which the accounts are assigned to be good and they are actually good accounts, so $506,009.54 is successfully earned on 2078 accounts. It is equiva-

lent that $243.507.96 is earned on 1000 accounts. This is a winning business in which $400.176.39 can be earned on the total 3271 accounts; equivalently, $122.340.69 can be earned on 1000 accounts.

3.4 4-Variable Model K-means Cluster and Profitability

Customers are clustered into 3 clusters. A SAS procedure FASTCLUS using K-means method was performed.

Figure 2: Plot of canonical variables identified by cluster value

The resulting plot (See Figure 2) illustrates the If the same cutoff probability of 0.116 at the sec-

spatial separation of the clusters calculated in the ond decile is applied to the Cluster1 segment, there

FASTCLUS procedure. Here blue circles represent is the profitability table (See Table 5) below. Higher

the Cluster1, which is assumed to be the best seg- profit can be achieved. The Cluster1 is determined

ment in profit. as the best segment.

Table 5.- Profitability table for Clusterl segment

Outcome type Percentage n Profit pper1000

ERROR1 0.1972097 523 ($96.893.07) ($185.264.00)

ERROR2 0.0343137 91 $0.00 $0.00

VALID1 0.0286576 76 $0.00 $0.00

VALID2 0.739819 1962 $477.764.34 $243.508.84

Total 1 2652 $380.871.27 $143.616.62

3.5 12-Variable Model A logistic regression model using 12-varible was

established as below (See Table 6). Table 6.- Analysis of maximum likelihood estimates for 12-variable model

Analysis of Maximum Likelihood Estimates

Parameter Parameter Description DF Estimate Standard Wald Pr > ChiSq

Error Chi-Square

Intercept 1 -1.63090 0.38640 17.81 <.0001

X1 Utilization of all revolving bankcard trades 1 0.00138 0.00032 19.16 <.0001

X2 Highest utilization on any single bank revolving trade 1 0.00716 0.00170 17.66 <.0001

X3 Total collection/charge off/ repossession dollars within 12 months 1 0.00006 0.00002 10.45 0.0012

X4 Percent of trades never delinquencies or derogatory 1 -0.01380 0.00441 9.82 0.0017

X5 Trades open greater than or equal to 1-year payment ratio 1 -0.01310 0.00435 9.11 0.0025

X6 Inquiries in last 6 months 1 0.08310 0.02800 8.83 0.003

X7 Aggregate utilization of revolving trades 1 0.00777 0.00270 8.31 0.0039

X8 Aggregate credit limit on revolving trades 1 0.00000 0.00000 7.44 0.0064

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

X9 Number of 30 DPD trades reported within 2 years 1 0.16170 0.06240 6.71 0.0096

X10 Number of30-180 DPD within 6 months 1 0.07540 0.02930 6.62 0.0101

X11 Number of revolving trades with high utilization 1 0.03660 0.01600 5.24 0.0221

X12 The average credit limit of trades 1 -0.00001 0.00001 4.94 0.0263

3.6 12-Variable Model Performance 72 and Area Under Curve of 0.72 (See Table 7 and

All 12 variables are significant in the level of Figure 3). 0.05. The model can achieve percent concordant of

Table 7.- Association of predicted probabilities and observed for 12-variable model

Association of Predicted Probabilities and Observed

Responses

Percent Concordant 72 Somers' D 0.44

Percent Discordant 28 Gamma 0.44

Percent Tied 0 Tau-a 0.171

Pairs 2.082.730 c 0.72

ROC Curves for Comparisons

1.00

0.75

>

% 0.50

cz dl

Lfl

0.25

0.00 -

0.00 0.25 0.50 0.75 1.00

1 - Specificity

ROC Cuive (Area)

-Model (0.7199)

-4-Variable Model (0.6886)

- 12-Variable Model (0.7199)

Figure 3. ROC curve for comparison

The ROC curve above shows that there is very 0.25 and 0.75. Will this difference be a big impact little difference between probabilities of the two on profitability? The gains table (See Table 8) will models, especially when 1-Specificity is between be explored further, tabulated as below.

Table 8.- Gains table for 12-variable model

Decile Default Cum Default Mean Default Cum Default Rate Default Capture Rate Min Score Max Score Mean Score KS

1 97 97 0.398 0.398 0.532 0.207 1 0.400 0.467

2 42 139 0.170 0.284 0.760 0.101 0.207 0.143 0.606

3 17 156 0.067 0.211 0.851 0.062 0.101 0.079 0.595

4 12 168 0.050 0.171 0.918 0.038 0.062 0.047 0.560

5 7 175 0.029 0.143 0.957 0.023 0.038 0.030 0.493

6 4 179 0.015 0.121 0.977 0.015 0.023 0.019 0.408

7 1 180 0.004 0.105 0.982 0.010 0.015 0.012 0.305

8 1 181 0.004 0.092 0.988 0.006 0.010 0.008 0.203

9 2 183 0.008 0.083 0.999 0.003 0.006 0.005 0.108

10 0 183 0.001 0.074 1 0.000 0.003 0.002 0.000

Total 183 183 0.074 0.074 1 0.000 1 0.074 0.606

The Gains and Lift charts show only a small ad- wise, it would be a good account. GoodBad can be

vantage of the 12-variable model over the simpler assigned to the scored data. For example, if the cutoff

one. KS achieves 0.606. probability of 0.143 from the mean score at second

3.7 12-variable Model Profitability Calculation decile is used, the model development data can be

Similar to 4-variable model profitability calcula- classified into 4 categories: ERROR1, ERROR2,

tion, when the predict default probability is greater VALID1 and VALID2. The profitability is listed in

than a given number, it would a bad account. Other- the below table (See Table 9).

Table 9.- Profitability table for 12-variable model

Outcome type pct n Profit pper1000

ERROR1 0.16631 544 ($98.427.90) ($180.933.65)

ERROR2 0.0963008 315 $0.00 $0.00

VALID1 0.0984408 322 $0.00 $0.00

VALID2 0.6389483 2090 $510.778.27 $244.391.52

1 3271 $412.350.37 $126.062.48

Here ERROR1 is a category in which the accounts are assigned to be good. However, the accounts are actually bad, so $98.427.9 is lost on 544 accounts. It is equivalent that $180.933.65 is lost on 1000 accounts; ERROR2 is a category in which the accounts are assigned to be bad but they are actually good accounts, so money is neither lost nor earned; VALID1 is a category in which the accounts are assigned to be bad and they are actually bad accounts, so a loss is successfully avoided; VALID2 is a category in which the accounts are assigned to be good and the accounts actually good accounts, so $510.779.27 is successfully earned on 2090 accounts. It is equivalent that $244.391.52 is earned on 1000 accounts. This is also winning business in $412.350.37 is earned on the total 3271 accounts; equivalently, $126.062.48 can be earned on 1000 accounts. 4. Discussion

The profit difference on 1000 accounts base is $3.721.79. It appears the 12-variable model has a

References:

1. Credit Default Risk Prediction. Available at: URL:https://repods.io/en/blog/Credit-default-risk-pre-diction

2. Modern Machine Learning Algorithms: Strengths and Weaknesses. Available at: URL:https://elitedata-science.com/machine-learning-algorithms

little advantage over the 4-variable model. However, the cost in term of time and money also needs to take into consideration. Using 4-variable model or 12-variable model would depend on how much it could cost in complexity when the number of predictors is increased from 4 to 12.

5. Conclusion

The research paper built two logistic models in predicting the likelihood of default. Two models were evaluated and compared based on concordance, AUC, KS, simplicity, and profitability. No recommendation is provided on which model is a better choice to a company, but the final profitability that each model can give is calculated. It will depend on the cost and incremental complexity to implement the models. The analysis also finished an unsupervised clustering process, targeting the most profitable cluster segment.

3. Peng C.J., Lee K. L., Ingersoll G. M. An Introduction to Logistic Regression Analysis and Reporting. The Journal of Educational Research, 96(1),- P. 3-14.

4. Fawcett T. An introduction to ROC analysis [J]. Pattern recognition letters, 2006; 27(8): 861-874.

5. Stokes M., Davis C. S. Categorical Data Analysis Using the SAS System, SAS Institute Inc., 2001.

6. SAS/STAT® 15.1 User's Guide the FASTCLUS Procedure. Available at URL:https://support.sas.com/ documentation/onlinedoc/stat/151/fastclus.pdf

i Надоели баннеры? Вы всегда можете отключить рекламу.