Научная статья на тему 'BIG DATA PROJECT - BANK MARKETING CAMPAIGN'

BIG DATA PROJECT - BANK MARKETING CAMPAIGN Текст научной статьи по специальности «Медицинские технологии»

CC BY
327
38
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
BANK MARKETING CAMPAIGN / MACHINE LEARNING / ARTIFICIAL NEURAL NETWORKS / LOGISTIC REGRESSION / SUPPORT VECTOR MACHINE / RANDOM FOREST

Аннотация научной статьи по медицинским технологиям, автор научной работы — Wu Yiheng

When I first looked at the database, I found it different and enjoyable. It is not only a real-life situation I can analyze but also a good chance for myself to combine information I have learned in school with new information I learned through machine learning and form a path to the final result. There are phrases I did not understand or have not even heard of, but conquering them through and finishing four major machine models is beyond another level of achievement for me. I made lots of illustrations including graphs and tables in order to better represent how machine learning functions and how results are achieved.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «BIG DATA PROJECT - BANK MARKETING CAMPAIGN»

Section 1. Marketing

https://doi.org/10.29013/EJEMS-21-2-3-15

Wu Yiheng,

The Summit Country Day School, Cincinnati, USA E-mail: 3203414217@qq.com

BIG DATA PROJECT - BANK MARKETING CAMPAIGN

Abstract. When I first looked at the database, I found it different and enjoyable. It is not only a real-life situation I can analyze but also a good chance for myself to combine information I have learned in school with new information I learned through machine learning and form a path to the final result. There are phrases I did not understand or have not even heard of, but conquering them through and finishing four major machine models is beyond another level of achievement for me. I made lots of illustrations including graphs and tables in order to better represent how machine learning functions and how results are achieved.

Keywords: Bank Marketing Campaign, Machine Learning, Artificial Neural Networks, Logistic Regression, Support Vector Machine, Random Forest.

1. Introduction Bank of Britain not only guarantees the identity of

The power of finance has long been inspira- every commercial bank in the UK but also allows tional to me. Understanding the usage and scope anyone to borrow and save money in the bank.

Inflation is a general rise in the price of goods and services over some time. It can be affected in different ways through policies introduced by the government, economic programs launched by a tycoon, or the financial habits people become accustomed to. In general, inflation causes money to lose value. Thereby, if the consumer price index rises, we would realize that our money has devalued. When we look at the inflation rate in recent years, shown in figure 1, it is between 1%~3%. In order not to lose money, we turned to promotions offered by financial institutions. Usually, banks offer a higher rate of return for investments than simply saving money in our accounts, allowing us to maintain the growth in our financial value by beating inflation. A loan is the opposite of a deposit, it means a special form of debt lent by a corporation.

of money has been proven essential to this modern-day world. We, as customers, or as vendors, need to use money or other currency forms to transact benefits. Based on this public will, banks emerged in early history, along with other simple rules on saving and using money. Later, financial services such as deposits and loans emerged. According to the Oxford Dictionary, the modern explanation for the word "deposit" is "a sum of money placed or kept in a bank account, usually to gain interest", whereas "loan" means anything that is borrowed and is typically expected to be paid back with interest. Moreover, starting as early as Medieval Ages in Great Britain, depositing valuables in banks was considered the safest way to save goods; time has passed and finally the Bank of Britain's new identity as the "lender of last resort" was acknowledged. The

Typically, banks attract customers by having a higher interest rate than the inflation rate. One way to achieve that is by offering a time/term deposit account which is a type of deposit account held by the bank where the money is locked up for some set period, ranging from one month to a few years (Chen [1]). Different from a deposit, a term deposit generally means a fixed investment in the form of deposits

and cannot be taken out before the term ends (Chen [1]). By investigating the deposits and loans a person obtains, we can greatly understand the distribution of wealth in terms of gender, occupation, and other factors. It is exciting for me to investigate how other factors, including the results from past contacts and days between marketing campaigns, influence the acceptance of term deposits among customers contacted.

Figure 1. Projected annual inflation rate in the U.S. from 2010 to 2021. (Statista Research Department)

2. Hypothesis 3. Data and Methods

As one's wealth builds up, he or she generally 3.1. Data obtains more savings compared to others. There- I have done some research and found this tre-

by, I would like to make the hypothesis that people mendous dataset I adopted from a Portuguese bank-

with jobs that are related to business, science, and ing institution under UC Irvine Machine Learning

other white-collar jobs are more likely to subscribe Repository. It has 16 predictor variables, as shown in

to term deposits after the bank's marketing cam- table 1, and one outcome variable which is whether

paign. the client subscribed to the term deposit.

Table 1.- A list of variables in the dataset

Variable Data type Description

1 2 3

age numeric Age of the client

job categorical Type of job

1 2 3

marital categorical Marital status

education categorical Education level

default binary Has credit in default?

balance numeric Average yearly balance, in euros

housing binary Has a housing loan?

loan binary Has a personal loan?

contact categorical Contact communication type

day numeric Last contact day of the month

month categorical Last contact month of the year

duration numeric Last contact duration, in seconds

campaign numeric Number of contacts performed during this campaign and for this client

pdays numeric Number of days that passed by after the client was last contacted from a previous campaign

previous numeric Number of contacts performed before this campaign and for this client

poutcome categorical The outcome of the previous marketing campaign

Output variable Data type Description

y binary Has the client subscribed to a term deposit?

3.2 Exploratory Data analysis

To explore the data, A bar plot of the outcome variable with the counts of observations in each categorical bin is shown in figure 2. It shows that the dataset is unbalanced with about 89% of the outcome being "no" and 11% being "yes". This can be addressed with various techniques during the machine learning phase. Figure 3 is a heatmap of the correlation matrix for numerical variables in the dataset. The color of the squares in the plot represents the magnitude of the correlation coefficient. No specific relationship had been found between different groups.

3.3 Data preprocessing

To prepare data for the machine learning models, texts need to be converted into numbers. This is done by One Hot Encoding after combining less frequent values for the columns with high cardinality like "job" and "month". Figure 4 shows the histo-

grams of "job" and "month" before combining the less frequent values.

Figure 2. Countplot of the outcome variable "y"

ra

TD

1.00 0.08 -0.02 -0.00 -0.01 -0.01 -0.00

0.08 1.00 -0.01 -0.02 -0.01 0.01 0.03

-0.02 -0.01 1.00 -0.02 0.16 -0.09 -0.06

-0.00 -0.02 -0.02 1.00 -0.07 0.01 0.02

-0.01 -0.01 0.16 -0.07 1.00 -0.09 -0.07

-0.01 0.01 -0.09 0.01 -0.09 1.00 0.58

-0,00 0.03 -0.06 0.02 -0.07 0.58 1.00

a. age balance

day duration campaign pdays previous Figure 3. correlation heatmap of numerical features

¡-1.00

-0.75

-0.50

-0.25

-0.00

-0.25 -0.50 B--0.75 ®-1.00

Figure 4. Distribution plot of variables "job" and "month" before combining less frequent values

Figure 5. Distribution of each feature variable, colored by "y"

Figure 6. Distribution plot of numerical features after scaling

Figure 7. Correlation heatmap of all fea

The histograms of each feature variable, shown in (figure 5), were plotted using Seaborn's "hist-plot" function after combing less frequent values which are lumped into "other". In (figure 5), the X-axis represents the data variable is divided into a set of discrete bins, whereas the Y-axis represents the population falling within each bin which is shown through the height of the corresponding bar. The observations within each bin are also

es after one-hot-encoding and scaling

color-coded by the outcome variable "y", blue bars represent the clients who didn't subscribe to a term deposit and pink bars represent the successful campaign cases.

One other thing in the preprocessing step is to put all the numerical data on the same scale. Figure 6 shows the histograms of numerical columns after scaling. In figure seven, visualizing by the heatmap, no strong correlation exists between categories.

4. Machine Learning Models 4.1 Artificial Neural Networks

The first model used was the artificial neural network model, which is a machine learning model inspired by the human brain. The model is implemented through Keras from Tensor Flow 2. Multilayer perceptron (MLP) is the most common form of neural network. A sequential model is the easiest way to build an MLP classifier. Activation function Rectified Linear Unit (ReLU) for the hidden layers and Sigmoid function for output layer were used.

denseinput: InputLayer input: [(None, 39)]

output: [(None, 39)]

1

input: (None, 39)

output: (None, 300)

dense l : Dense input: (None, 300)

output: (None, 200)

dense_2: Dense input: (None, 200)

output: (None, 1)

Figure 8. MLP model summary

After we created the sequential model, we used the Stochastic Gradient descent (SGD) optimizer and "binary cross-entropy" loss function to compile the model with accuracy as the metric. The model was trained for 30 epochs, each epoch means the data is being fed through the MLP once. In figure 8, we can see a general process of how the program trained the data, giving the input and generating the output. Figure 9 here we have an illustrative graph showing the neural network. Figure 10 shows the loss and accuracy of the train and validation set during training, where it reflects that the accuracy of the

program is improving. In figure 11, we get to look closer at the trend of increasing in accuracy and decreasing in loss during the process. It is reasonable that training accuracy is always higher than testing accuracy while training loss is always lower than the testing loss simply because testing shows what needs to be improved.

4.2 Logistic Regression

Regression is a type of math analysis that puts data together so that humans can analyze the trend or pattern of the relationship between variables. Logistic regression is, thereby, a machine-learning process that predicts binary information. Logistic regression can be used to categorize different classes of information or performed as the same use as a linear regression - predicting continuous outcome. There should be a dependent/target variable and a set of independent/feature variables.

In our ROC curve (receiver operating characteristic curve) for the logistic regression model, the X-axis is the false positive rate, which means the fraction of negatives that have been selected as positive incorrectly. On the Y-axis, we have the True positive rate, which measures the fraction of the initial positives which have been predicted correctly as positive by the classifier.

Table 2 shows the odds ratios (converted from the log-odds which are the coefficients provided by the logistic regression classifier) of all the feature variables. The impact size of factor "balance" can be expressed by its odds ratio:

Odds subscribed/balance = x + 1Odds subscribed/ balance = x

When an odds ratio has a value above one, it means that the corresponding feature variable is positively associated with the target variable. For instance, in the table below, for balance we see an odds ratio of about 2.22. It means that we will see an increase of about 122 percent in purchasing term deposits when the balance of that customer's account increased by 1.

model_visuatizaiion Input Layer (+29)

Output Layer Figure 9. Artificial Neural Networks visualization

- loss - accuracy - va IN oss - val accuracy —i-i—

0 5 10 15 20 25 30

Figure 10. MLP train and validation set loss and accuracy plot during training

0

1

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

accuracy macro avg weighted avg

precision

0.96 0.33

0.65 0.89

Epochs

Figure 11. Train and Test set loss and accuracy of MLP

recall f1-score support

0.80 0.98 801

0.77 0.47 104

0.80 905

0.79 0.67 905

0.80 0.83 905

Figure 12. Logistic Regression Model classification report and ROC curve

Table 2. - Odds ratios of selected feature variables

Feature Variable Odds ratio (>= 1.5 or <= 0.5)

balance 2.22041722

duration 3.13496465e+05

pdays 6.09432521

previous 11.58689

month other 2.32546258

Job-related variables Odds ratio

job adm 1.08849395

job blu 8.94624502e-01

job man 1.06234040

job other 1.02965993

job ser 9.70727022e-01

job tec 9.72869526e-01

4.3 Support Vector Machine cally, a boundary. A hyperplane in a 3-D graph is a

The first thing we do in an SVM is to find a plane, while it is a line in a 2-D graph, as similarly,

hyperplane. What a hyperplane does is that it best a dot in a line. separates the two different-class information, basi-

Figure 13. Support Vector Machine

Our goal is to find a hyperplane that has the largest margin between the two different classes ofdata. Here, we reach the support Vectors. They are the points, or positions, that are closest to the hyperplane, and serve a great role in changing the hyperplane. The AUC (Area under the curve) above represents how well the machine separates different classes.

4.4 Random Forest

The random forest is a powerful yet illustrative technique that puts relative data together. Each small

classification report and ROC curve

section we see below in (figure 15) is called a node. There are two kinds of nodes in general, the decision nodes and the leaf nodes. We try to split each node with a specific type of method that allows us to eliminate overfitting and eventually binning outliers and non-linear data.

Decision trees help us with that. The decision tree is a model that processes the data based on the independent variables.

0

1

accuracy macro avg weighted avg

precision

0.89 0.52

0.71 0,85

recall f1-score support

0,99 0,91 801

0,11 0,18 104

0.89 905

0,55 0,56 905

0.89 0.85 905

Figure 14. Random Forest Model classification report and ROC curve

1 V

Figure 15. Part of the random forest visualization

In this case, the model will train each sample is your marital status?". Thereby, we divide samples with questions such as: "What is your job?", "What through different categories in each branch. Even-

tually, when we combine these decision trees together, we have the random forest model. One key term of this model is Gini Impurity, which is what we try to eliminate since it points to the wrong classification of nodes when splitting.

5. Conclusion

Upon all four models we have employed, and the previous data analysis, we found out that there is no direct relationship between what kind of jobs one has and the decision to buy term deposits as we previously hypothesized. For our neural network model, the curve is generally smooth, which means that the model gets more accurate at a slow but steady pace. Our process and result in logistic regression and support vector machines are quite similar. In both models, we saw a similar ROC curve, with the AUC0.89, closer to 1 than to 0.5, which means that the models have relatively high-quality performance in distinguishing between different categories' effect on the project as a whole. We get a high precision

score in logistic regression, but with a relatively low accuracy of 0.8. On the other hand, though not that precise, Support Vector Machines proved to be more accurate. In the end, since our data is imbalanced between category 0 and 1, where category 0 has far more data and higher accuracy according to the f1 value, the macro average (calculates the unweighted mean accuracy per label or category) is far lower than the weighted average (calculates the weighted mean accuracy per label). However, both average values from the Random Forest model are higher than the other models. Therefore, random forest is one of the most successful models for this dataset.

6. Discussion

Data errors do exist in our data project. We can see that our source based on a bank in Portugal is a little outdated. As the old saying goes, a machine learning model can only predict what the data illustrates. We may have better predictions if we manage to get a more refined and recent dataset.

References:

1. Pettinger Tejvan. "Purpose of Banks." Economics Help. URL: http://www.economicshelp.org/blog/ glossary/banks/#:~: text=A%20bank%20is%20a%20financial, for%20a%20variety%20of%20loans

2. Kagan Julia. "Loan." Investopedia, Investopedia, 1 Dec. 2020. URL: http://www.investopedia.com/ terms/l/loan.asp

3. "What Is Logistic Regression?" Statistics Solutions, 9 Mar. 2020. URL: http://www.statisticssolutions. com/what-is-logistic-regression

4. Statista Research Department. "Projected annual inflation rate in the United States from 2010 to 2021." Statista. URL: http://www.statista.com/statistics/244983/proj ected-inflation-rate-in-the-united-states/ Accessed 19 022021.

5. Statista Research Department. "Projected annual inflation rate in the United States from 2010 to 2021. Statista, URL: https://www.statista.com/statistics/244983/projected-inflation-rate-in-the-united-states/ Accessed 19 022021.

6. Gandhi Rohith. "Introduction to Machine Learning Algorithms: Logistic Regression." Hacker Noon, 28 May 2018. URL: http://hackernoon.com/introduction-to-machine-learning-algorithms-logistic-regression-cbdd82d81a36

7. Giraud Aurelie. "Quick Intro to Random Forest." Medium, Towards Data Science, 31 Mar. 2020. URL: http://towardsdatascience.com/quick-intro-to-random-forest-3cb5006868d8.

8. Narkhede Sarang. "Understanding AUC - ROC Curve." Medium, Towards Data. Science, 14 Jan. 2021. URL: http://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

i Надоели баннеры? Вы всегда можете отключить рекламу.