Научная статья на тему 'THE VALUE OF INSURANCE EVALUATING THE SEARCH FOR SOLUTIONS TO PROBLEMS BASED ON MACHINE LEARNING METHODS: CASE OF CHURN IN INSURANCE'

THE VALUE OF INSURANCE EVALUATING THE SEARCH FOR SOLUTIONS TO PROBLEMS BASED ON MACHINE LEARNING METHODS: CASE OF CHURN IN INSURANCE Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
40
11
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
CHURN / RANDOM FOREST / XGBOOST / WISDOM OF THE CROWD

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Koffi Evelyne Flore

The object of the research is to find algorithms to solve the churn problem in insurance using machine learning methods.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «THE VALUE OF INSURANCE EVALUATING THE SEARCH FOR SOLUTIONS TO PROBLEMS BASED ON MACHINE LEARNING METHODS: CASE OF CHURN IN INSURANCE»

UDC 62

Koffi Evelyne Flore

Master in Data Science Samara, Russia

THE VALUE OF INSURANCE EVALUATING THE SEARCH FOR SOLUTIONS TO PROBLEMS BASED ON MACHINE

LEARNING METHODS: CASE OF CHURN IN INSURANCE

Abstract

The object of the research is to find algorithms to solve the churn problem in insurance using machine learning methods.

Keywords

Churn, Random Forest, XGBoost, The wisdom of the crowd.

The insurance industry has always relied on data to calculate risk and provide personalized scores. Today, the sector is undergoing a deep digital transformation thanks to technologies such as machine learning. Insurers use machine learning to improve operational efficiency, improve customer service, and even detect fraud.

The churn rate is the indicator that allows you to calculate the loss of customers, users or subscribers that your business suffers over a given period.

The churn rate is widely used in Banking and Insurance which are sectors that historically had a very low churn rate[1]. According to Corporate Ink for CallMiner, as of June 24, 2020. Customer switching costs insurance industry £5.04 billion [2]

Random Forest

Key idea behind Random Forest

The wisdom of the crowd is a phenomenon linked to the law of large numbers, which means that a crowd of individuals is more often right than an expert alone, provided that this crowd is: sufficiently large, competent and diverse. In machine learning we can use this concept to create sets of models that lead to better performance. You just have to meet the three criteria of size, competence and diversity.

Figure1 - Random Forest Classifier illustration

As you can see in the illustration picture above random forest:

• Can be perceived as The wisdom of the crowd.

• Creating several copies of a decision tree model by training each decision tree on a random sampling set of the dataset.

• Use a sampling technique called Bootstrapping.

• It is used to solve both classification and regression problems. The Mathematics Behind Random Forest

How to build each learner

To determine how your data branches from each node of the decision tree we use Gini or Entropy index.

Gznz = l-X(p,)2

Gini index

Entropy = X -P i*log2 ( p, )

¿=1

Entropy index

Where Pi is frequency of class I. What about Random Forest learner

Random Forest learner is an aggregation of each tree learner.

B

gbag(x) = arg maxVl|((x)=k

ki-l....K f^

- k is the number of class

- Ql is the l-th learner

XGBOOST classifier

Key idea behind XGBoost

TTairiinjï Wolltet! 1 Weighted Weighted

Sample Sample Sample Sample

1st tvuak J i itl wvak Sid Weak M-th weak

( l.isMf i'r cliisïifkv i l.isMliri cla&tiflvr

I

Final Classifier

Figure 2 - XGBoost Classifier illustration

XGBoost is a boosting algorithm:

- Can be perceived as The wisdom of the crowd.

- We train our models in series.

- A decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. The Mathematics Behind XGBoost [3]

XGBoost objective function

The objective function (loss function and regularization) at iteration t that we need to minimize is the following:

^¿»¿и^'+лмнод

Real value (label] known from the training data-set

£r

/: loss function et

¿=1

: prediction from previous learner

Ä.M)

ft (Xi) for prediction for current learner It is easy to see that the XGBoost objective is a function of functions (i.e. l is a function of CART learners, a sum of the current and previous additive trees)

Ca n be seen as f(x + Äx] where x = Vj

Taylor's Theorem and Gradient Boosted Trees

We can see as an example that the best linear approximation for a function f(x) at point a is:

f(a) + f ' (a)(x—a)

X

a

Figure3 - Taylor linear approximation of a function around a point a

We need to use the Taylor approximation to transform the original objective function to a function in the Euclidean domain, in order to be able to use traditional optimization techniques.

How to build the next learner

Being at iteration t we need to build a learner that achieves the maximum possible reduction of loss. The good news is that there is a way to "measure the quality of a tree structure q" , the scoring function is the following :

The tree learner structure q scoring function.

While the bad news is that it is impossible in terms of required calculations to "enumerate all the possible tree structures q" and thus find the one with maximum loss reduction. Binary classification with log loss optimization

Let's take the case of binary classification and log loss objective function:

Binary classification with Cross Entropy loss function, where y is the real label in {0,1} and p is the probability score.

Note that p (score or pseudo-probability) is calculated after applying the famous sigmoid function into the output of the GBT model x.

The output x of the model is the sum across the CART tree learners.

instances mapped to leaf;

Data analysis

Data presentation

- For our work, the above algorithms will be developed using data from kaggle.com [4]

- There are 17 columns and 33908 rows in our dataset

- We want to check if we can predict "churn" from some set of variables and how well we can do it. There is a high rate of churn. We see with the figure that we are in a case of unbalanced data.

Figure 4 - Churn

Preprocessing

For a better use of the data, it is necessary to preprocess the data. The text preprocessing steps will include the following steps:

- Rename the columns

- Separation of data in x and y

- Normalization of x

- Separation of data into training data and test data

- We use the SMOTE algorithm to solve the problem of unbalanced data.

Model evaluation

Random Forest Evaluation

9 1

Figure 5 - Confusion Matrix

precision recall f1-score support

0 0.95 0.94 0.94 5989

1 0.58 0.62 0.60 793

accuracy 0.90 6782

macro avg 0.76 0.78 0.77 6782

weighted avg 0.91 0.90 0.90 6782

Figure 6 - Classification_report

The summary table of our data shows us that we have an accuracy of 0.90, which means that out of a hundred individuals in our test dataset, the model is able to correctly predict the class of membership of 90 individuals.

We can observe the results of predictions in the confusion matrix.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

We have an AUC of 0.93 We notice on our ROC curve that our algorithm works well.

XGBoost Evaluation

Figure 7 - ROC curve

о 1

Figure 8 - Confusion Matrix

precision recall fl -score support

0 0.95 0.95 0.95 5989

1 0.59 0.60 0.60 793

accuracy 0,91 6782

macro avg 0.77 0.77 0.77 67S2

weighted avg 0.91 0.91 0.91 6782

Figure 8 - Classification_report

The summary table of our data shows us that we have an accuracy of 0.91, which means that out of a hundred individuals in our test dataset, the model is able to correctly predict the class of membership of 91 individuals.

We can observe the results of predictions in the confusion matrix. We have an AUC of 0.92 We notice on our ROC curve that our algorithm works well. The accuracy for the Random Forest algorithm is equal to 0.90 which is lower than that of the XGBoost algorithm which is 0.91.

We can therefore conclude that the XGBoost algorithm better predicts the churn. We therefore decided to keep XGBoost for the rest of our work.

Figure 9 - ROC curve

Literature

1. Qu'est-ce que le taux d'attrition et comment le réduire ?, Cyrielle Chauwin, 23/10/2017

2. Corporate Ink for CallMiner

3. dimleve.medium.com/xgboost-mathematics-explained

4. kaggle.com/blentalikan/insurancechurnprediction/data

© Koffi Evelyne Flore, 2022

УДК 004

Гизатуллина Р.Р.

Казанский национальный исследовательский технический университет

имени А. Н. Туполева, Казань, РФ

РАЗРАБОТКА ОНЛАЙН-СЕРВИСА ПО ПОИСКУ РАБОТЫ И СОТРУДНИКОВ

Аннотация

В статье реализован аналитический обзор существующих онлайн-сервисов, предназначенных для поиска работы, описаны принятые решения по разработке собственного веб-сайта.

i Надоели баннеры? Вы всегда можете отключить рекламу.