Научная статья на тему 'Americans’ worry about financing retirement: comparing three predictive models'

Americans’ worry about financing retirement: comparing three predictive models Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
302
43
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
RETIREMENT / FINANCIAL WORRY / RANDOM FOREST / LOGISTIC REGRESSION / MULTILAYER PERCEPTRON

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Rui Yang

This paper aims to build a predictive model about Americans’ financial worry over retirement on the basis of demographic factors and subjective financial, physical, and mental health conditions. In this paper, a cross-sectional nationally representative data set based on the National Health Interview Survey 2018 was used. Missing values in the data set were first indicated by dummies features and then replaced using mean value imputation. Three machine learning models, random forest, logistic regression, and multilayer perceptron, were built, and their respective performances were compared. As a result, all three algorithms reported fairly similar results with approximately 0.9 true positive rate (TPR), 0.3 false positive rate (FPR), and 0.88 area under curve (AUC). We also found that financial condition is the most important factor relating to people’s financial worry. As a result, policy-makers should put more weight on this factor when designing specific policies or deciding an individual’s eligibility to receive necessary assistance.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Americans’ worry about financing retirement: comparing three predictive models»

https://doi.org/10.29013/ESR-20-1.2-13-22

Rui Yang, Emory University E-mail: yr159123@hotmail.com

AMERICANS' WORRY ABOUT FINANCING RETIREMENT: COMPARING THREE PREDICTIVE MODELS

Abstract. This paper aims to build a predictive model about Americans' financial worry over retirement on the basis of demographic factors and subjective financial, physical, and mental health conditions. In this paper, a cross-sectional nationally representative data set based on the National Health Interview Survey 2018 was used. Missing values in the data set were first indicated by dummies features and then replaced using mean value imputation. Three machine learning models, random forest, logistic regression, and multilayer perceptron, were built, and their respective performances were compared. As a result, all three algorithms reported fairly similar results with approximately 0.9 true positive rate (TPR), 0.3 false positive rate (FPR), and 0.88 area under curve (AUC). We also found that financial condition is the most important factor relating to people's financial worry. As a result, policy-makers should put more weight on this factor when designing specific policies or deciding an individual's eligibility to receive necessary assistance.

Keywords: retirement, financial worry, random forest, logistic regression, multilayer perceptron

Despite recent rebounds in the economy, Ameri- tal status and people's degree of concern about the

cans today have become increasingly worried about adequacy of their retirement plans. The result of

financing their retirement since the Great Recession. their study showed that compared to other groups,

A survey conducted by the Pew Research Center in singles and those who have experienced negative

2012 has shown that people's worry about retirement financial shocks worry more about financing retire-

finances has been on the rise recently, compared to ment. In 2004, Morgan and Eckert found that age,

25% in 2009, about 38% adults expressed they are "not health, and income are significant predictors of fi-

too" or "not at all" confident that they have enough nancial preparation and anxiety. money to support their retirement plans (Morin & Almost all of these studies, however, were done

Fry [18]). The same research has also shown that the on the macro-level. There is a noticeable lack of stud-

proportion of people worry about financing their re- ies that actually connect factors, such as age or mari-

tirement varies greatly among different age groups and tal status, to build a model that gives predictions on

income groups, with the mid-age and mid to low- in- an individual level. This paper aims to fill this gap

come level individuals being most worried. through developing, validating, and comparing sev-

While the subject of financial preparation for eral models based on these factors, and, at the same

retirement has been extensively explored, most re- time, find the most significant predictor relating to

search has only focused on how different factors are people's financial worry after retirement. A more ac-

associated with people's financial anxiety. Research curate predictive model might help policymakers to

by Owen and Wu in 2007 established the relation- tailor specific policies on the micro-level to address

ship between psychological factors as well as mari- such worries more effectively.

Method

Data Set Description

The National Health Interview Survey (hereinafter referred to as NHIS) is a nationally representative cross-sectional household study which monitors trends of illness and the progress of current health objectives. The survey was initiated in 1957 by Centers for Disease Control and Prevention and has been conducted annually since then with an approximate 70% response rate (National Center for Health Statistics, [9]). Data from the survey conducted in

Table 1. - Description

2018 was used in this analysis, which had a response rate of 83.9%, with 25,417 completed surveys out of 30,297 eligible adults (NHIS2018 Survey Description [11]). The dataset from the 2018 NHIS (NHIS2018 Sample Adult File [11]) was used to identify potential associations of financial worry after retirement among U.S. adults with factors including demographic features, health condition, smoking behavior, sleeping habits, and mental condition.

A list of 28 different indicator features used in the model can be found in (Table 1).

of the 28 Features

Groups Code Description

1 2 3

Other

SEX Sex

REGION Region

RACERPI2 Race group

AGE P Age

R MARITL Marital status

Employment

YRSWRKPA Number of years on the job

WRKLYR4 Whether have a job or business at any time in the past 12 months

Health

HYPEV Ever been told having hypertension

CHDEV Ever been told having coronary heart disease

CANEV Ever been told having cancer

LIVEV Ever been told having chronic liver condition

AMIGR Had severe headache/migraine in the past 3 months

AINTIL2W Had stomach problem with vomit/diarrhea in the past 2 weeks

DIBEV1 Ever been told having diabetes

ARTH1 Ever been told having arthritis

Smoking

SMKSTAT2 Had smoked at least 100 cigarettes

Financial

ASISTLV How worried about not being able to maintain the standard of liv-

ing

ASINBILL ... pay monthly bills

ASIHCST ... pay rent/mortgage/housing costs

ASICCMP ... make credit card payments

ASICNHC ... afford medical costs of healthcare

1 2 3

Mental

ASISLEEP Hours of sleep on average

ASISLPFL Number of times having trouble falling asleep in the past week

ASISAD How often did you feel so sad that nothing cheers you up in the past month

ASINERV How often did you nervous in the past month

ASIRSTLS How often did you restless in the past month

ASIHOPLS How often did you hopeless in the past month

ASIWTHLS How often did you worthless in the past month

Data Pre-processing

Recodin

Currently, the U.S. full retirement age is 66 years and 2 months (U. S. Social Security Administration [19]). For the purpose of this study, sample points with an age greater than 67 years were excluded from the data set. The final sample size is N = 19.090. The outcome variable is coded as "ASIRETR", corresponding to the question in the survey "How worried are you right now about not having enough money for retirement?" The effective responses consist of four levels, indicating the respondents are "very," "moderately," "not too," or "not" worried about their financial condition after retirement. Then this variable was recoded into a binary variable where the first three levels were combined and coded as 1, while the last level was coded as 0. In this way, one indicates a respondent has financial worry over retirement while zero indicates the respondent does not have such feeling.

The explanatory variable representing race (coded as "RARACERPI 2") was recoded into three groups: white, black/African American, and other, where the third group is a combination of other races as well as multiple races.

The nominal variable is one kind of categorical variables whose levels are simply labels and thus does not contain any meaning of order. For example, in the variable "REGION", "Northeast" is encoded as 1, and "Midwest" is encoded as 2. Even though we want these two levels to be equally weighted, it is

usually problematic if we feed the feature directly into the model, as most algorithms will mistakenly assume that Midwest is greater than Northeast. Thus the results produced in this way may not be optimal. A way to solve this is to use the one-hot encoding (Raschka [16]). The idea behind this approach is to create a dummy feature for each unique value in the nominal variables. Here, for each region in the "REGION" variable, a new binary feature was created whose values were used to indicate the particular region of a sample.

Missing value

It is common in real-world applications that the samples contain missing values for variable reasons. In the NHIS data set, a missing value is introduced when the respondent either refused to answer the question or did not understand the answer. Most machine learning algorithms become problematic when missing values are present within the data set. A convenient yet defective approach to handle this is simply removing data that contains a missing value. However, depending on the size of missing values, we may end up removing too many sample points, which introduces a significant new selection bias, and, at the same time, we take the risk of losing valuable information that the classifier needs to learn model parameters. An alternative approach is to use mean value imputation, where the missing values are replaced by the mean of that feature.

As can be seen in (Figure 1), the frequency of missing values is relatively low in both classes: most features

have less than 5% missing values with only one exception that contains about 7% missing values. As a result, using mean value imputation is unlikely to have a significant impact on the overall reliability of the data set.

In this research, two steps were taken to treat the missing values. First, a dichotomous dummy feature was created aside each feature, using one to

indicate sample points containing a missing value in the corresponding feature and zero otherwise. Then the mean imputation method was utilized to fill the blank. This two-step method had the advantage of not only keeping all the original information, but also making the whole data set useable by nearly all algorithms.

Figure 1. The distribution of

Standardization

Some machine learning algorithms, such as neural networks, require a specific technique called feature standardization for better training speed and accuracy. Feature standardization transforms different features into comparable scales and ensures all features weigh equally in the training process.

For each feature, its mean and standard deviation were first computed as x and s. Then each data point

x. with respect to that feature was replaced by y calculated as:

y =

s

After standardization, all features follow a standard normal distribution with mean zero and unit standard deviation.

missing values for each class

Splitting up the data

For most machine learning algorithms, the data set typically needs to be randomly split into one training set and one test set. The training set is first fed into the model to learn parameters, while the test set is held untouched. Then the test set is used to give unbiased performance measure of how well the model fits on the training set. In this model, the training set has 70% of data and the test set has the remaining 30%.

Machine Learning Models

Logistic regression

Logistic regression is one of the most well-known algorithms for classification that performs very well on linearly separable classes. In logistic regression, each feature has its specific weight wi. The net input

z is calculated as a linear combination of input and feature weights, which is derived by:

z = w x = w0 + w1x1 + ■

+ w x

m m

Where m is the number of features. Given z in the entire real number range, it then can be transformed into the probability that a sample belongs to a certain class through the logistic function:

V(z) = --^

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

1 + ez

The goal of logistic regression is to find the optimal weights that maximize the likelihood of overall classification, which then becomes to minimize the cost function over the entire data set given below:

C (w) = -- log(<p(z!)) + (1 - )log(1 -y(z'))

n i=1

A typical problem during the training process is overfitting. Overfitting occurs when the model is more complex than the data. Overfitting can be identified when the model has much better performance on the training set than test set. A possible way to reduce high variance is to introduce regularization into the model. The concept behind regularization is to add an extra term to the cost function which gives penalties to the extreme parameter weights. A common form of regularization is called L2 regular-ization, which can be written as follows:

i m

L = - fw2

c^ ¿—t i

C i=i

This paper used L2 regularization and 5-fold cross validation in the model. The regularization parameter C was set to be 100, which was found via grid search. The model was built using the scikit-learn package with other options set by default. Raschka's book provides more details about logistic regression and regularization [15].

Artificial neural network

An artificial neural network is a computational model inspired by biological neural networks. Unlike linear models such as logistic regression, neural networks are able to capture non-linear relations within data, which makes the algorithm stand out from linear models when the relationship between the input and output is highly complicated.

A multilayer network is an artificial neural network that consists of one input layer, one or multiple hidden layers, and one output layer. The input layer is the first layer, the output layer is the last layer, and any layers between them are hidden layers. The data are passed into the input layer, processed by the hidden layers, and finally transformed into predicted labels in the output layer.

An activation function in neural networks brings in the much desired non-linearity property that enables the model to capture almost any relationship. The three most common activation functions are logistic function, hyperbolic tangent function, and rectified linear unit (ReLU) given below: f (x) = max(0, x)

This paper used ReLU as the activation function for the neural network. One advantage of ReLU over the other two functions in this model is the reduced computational cost (Arora et al. [2]). In this model, the network has two hidden layers, with 16 nodes in the first layer and 8 nodes in the second layer. The network was trained via a technique called back propagation. More detail about this technique and general neural networks can be found in Raschka's book [17]. Like logistic regression, this paper used L2 regularization in the training process, and the regularization parameter a was set to be 0.0572, which was found via grid search.

Random forest

Random Forest is an ensemble learner where multiple weak learners (decision trees) are trained independently on a random sample of the training data with replacement. Usually, a single decision tree has a high chance of overfitting the data when it grows deeper. However, the high variance can be reduced when we introduce various uncorrelated decision trees into the same data set. Generally, a random forest outperforms a decision tree and can achieve a better balance between variance and bias. A more detailed description of random forest can be found in Breiman's article [3].

The random forest in this model was built using scikit-learn with bootstrap samples, 100 decision

trees, and 50 minimum sample leaves. The cross-entropy function given below was applied as the cost function to maximize the information gain for each decision tree:

It (t) = p log2 p + (1 - p)log2 (1 -p),

Where p is the proportion of the samples that belongs to class zero for a particular node t (Witten et al. [24]). Unlike in some other random forest models that use the majority vote, the predicted labels of each sample point in this model are decided by averaging their probabilistic predictions of decision trees involved.

Environment

The data pre-processing process was mainly conducted in R3.6.1 (R Core Team [14]). The missing value visualization was produced using the package ggplot 2 (Wickham [21]), the data cleaning process was done with packages dplyr (Wickham et al. [22]) and tidyr (Wickham & Henry [23]), and dummy features were created using dummies (Brown [4]). The data set read-in, partition, model training, and model validation process were done using scikit-learn (Pedregosa et al. [13]), SciPy (Virtanen et al. [20]), NumPy (Oliphant [12]), and pandas package (McKinney [7]). Other graphs were produced using Matplotlib (Hunter [5]).

Model Validation

The most common metrics measuring the performance ofbinary classification models are confusion matrix, receiver operating characteristic (ROC), and area under curve (AUC). A confusion matrix is simply a matrix that lays out the counts of true positive, true negative, false positive, and false negative predictions of a classifier. Figure 2 displays the meaning of these terms.

Figure 2. An illustration of confusion matrix

ROC graphs are useful in comparing the performance of different models. The x-axis of a ROC graph is the false positive rate, and the y-axis is the true positive rate. When giving prediction to a particular test case, most machine learning algorithms return a probability rather than a binary label. The classification is made when we set a decision threshold to dichotomize the result. When we shift the decision threshold from 0 to 100%, the false positive rate and the true positive rate will also change accordingly, which becomes the ROC curve if we connect those points together. The diagonal of a ROC graph can be interpreted as random guessing, while a curve that falls below the diagonal is said to perform worse than guessing. A curve at the top-left corner with a true positive rate 100% represents the performance of a perfect classifier that gives correct predictions under any decision threshold. In general, a classifier has better performance if its RO C is closer to the top-left corner.

It might be hard to identify which algorithm performs better by looking directly at ROC curves, especially when one curve is not totally enclosed in another. AUC overcomes this by finding the area under the ROC curve. A theoretically perfect classifier will have an AUC of 1, while a classifier that guesses randomly will have a value of 0.5. Generally, a higher value indicates better classification performance.

Results and Discussion

The confusion matrix of three algorithms can be found in (Figure 3). All three algorithms have about 90% true positive rate and 30% false positive rate. In the diagnosis of worry about financing retirement, we are more concerned about providing potential financial and mental assistance to those who are truly anxious. The models have 90% accuracy in identifying those people. At the same time, it is also important to decrease the potential waste of resources on those who are incorrectly identified as positive samples. In contrast to the true positive rate, the false positive rate provides insights about the fraction of incorrect positive samples out of total negative samples.

Normalized L'cnfjç;on Matrix

flDrmal ized ConFu siori Matrix

Normalized Lontjsisi Matrix

I

p

0« OJl

u as

OB OT №6 OS 04 0.3 0Î

ÙTÎ

an 090

OS OT 0« Si 04 03 0Î

ft71 US

Dll 08»

u 06 ■is

D4 0.Î 0Î

g i

Prided lawl

0 1 Pt*dlt[id libel

0 1

Pttdxttd i*t»i

Figure 3. Confusion matrix: random forest (left), logistic regression (middle), multilayer perceptron (right)

Table 2 shows a comparison of three algorithms used in this study. Each algorithm was run 10 times, and the values shown are mean and standard deviation of the AUC score. As can be seen from the table as well as Figure 4, all algorithms report fairly similar results, with the multilayer perceptron having a slightly better score than the other two. However, considering the influence created by different partitions of the data set as well as other random disturbance in the training process, such difference can

safely be ignored. As a result, all algorithms demonstrate the same performance on this data set.

Table 2.- AUC results generated from three classification algorithms

Algorithm Mean Std

Random Forest 0.884 0.0032

Logistic Regression 0.883 0.0034

Multilayer Perceptron 0.886 0.0042

Figure 4. The mean ROC curves for three classifiers trained on the data set

When a model has high training accuracy but low test accuracy, the model is said to have high variance, and when both training and test accuracy are low, the model is suffering from high bias. Random forest is a pretty robust algorithm that is unlikely to overfit data, as long as the number of

10

0.8

a- 0.6

E

=

u

0.0

t=

-*- Training score -*- Testing score

decision trees are large enough. To further diagnose the existence of variance and bias issues within the other two models and whether increasing the number of training samples will help, a learning curve was plotted for each algorithm in (Figure 5).

1000 15.00 M00 K0Q 3000 3500 Numberof training samples

4QÜ0 4500

10

0.8

fc 06 <0

0.4

0?

CO

-•— 1- -*

*- —

-*- Training score —»- Testing score

1000 1S00 2000 2S00 3000 3S00 Number of training samples

«00 4500

Figure 5. Learning Curves of two models: logistic regression (left), multilayer perceptron (right)

For both models, the testing score comes closer to the training score when the sample size increases from 1 000 to 2 500. Then both remain at approximately the same accuracy and stop improving with further enlargements of the sample. As a result, to further improve the classification accuracy, it would be of little help to collect more data, but it is possible to introduce new features to build a more complex model. In addition, as a linear model, the regular logistic regression might perform poorly in capturing the non-linear relationships within data. To remedy this, future research can focus on including higher-order combinations of original features in the model in order to achieve a better trade-off between bias and variance.

When building a predictive model for situations such as worry about financing retirement, most researchers care not only about the performance of their model, but also whether the model is able to provide a way that enables human users to interpret the results. However, most machine learning

algorithms do not offer a straight-forward explanation about the relationship. In neural networks, for example, an input is passed through many layers of transformations with thousands or even millions of mathematical operations involved. Such processes make the algorithm difficult to interpret. Random forest does better for this purpose, as it is able to calculate feature importance via a technique called mean decrease in impurity (MDI). Louppe et al. in [6] provided a description about the technical details involved in this technique used in scikit-learn package. Figure 6 shows ten of the most important features and their respective importance in random forest.

It is worthwhile to note that the importance of all features adds up to 1. Referring to Table 1, it is not difficult to see that the top five features are in the financial group, which accounts for about 0.85 of total importance. Based on this result, the financial group is the most informative group and plays the most essential role in this model.

Figure 6. Ten most important features in random forest

Therefore, upon proposing policies to address financial worry after retirement in a more specific way, legislators should focus more on improving people's current financial condition or overall economy rather than tailoring specific policies for different races, ages, or other demographic groups. The results have also demonstrated that mental health, even though some factors are among the top ten features, actually plays a minor role in people's thinking process, and thus should be less emphasized.

Conclusion

The intention ofthis study was to build a predictive model with the best performance and to investigate the factors most related to Americans' worry about financing retirement. Three different models were built, and all of them have achieved a similar and superb performance. The study finds that current financial condition is the key group involved in this process.

One limitation of the study is that it only establishes the importance of financial factors in predicting worry, but has not actually quantitatively discussed their relationships. A potential direction in this area could be to analyze how they are correlated. In addition, the mean value imputation method used in this paper to complete missing values is a very simple and rough approach: all samples with a missing value in a feature will be replaced by the same value. This method does not consider the potential feature correlations and is likely to reduce data variance. For future studies, more advanced imputation methods like k-nearest neighbors (kNN), which replaces missing elements with the mean of k nearest neighbors of that particular sample, can be used to obtain better performance.

References:

1. Ann L. Owen & Stephen Wu. Financial shocks and worry about the future. Empirical Economics, 33(3), 2007.- P. 515-530. URL: https://doi.org/10.1007/s00181-006-0115-0

2. Arora R., Basu A., Mianjy P. & Mukherjee A. Understanding deep neural networks with rectified linear units. 2016. ArXiv Preprint ArXiv:1611.01491.

3. Breiman L. Random forests. Machine Learning, 45(1), 2001.- P. 5-32.

4. Brown C. Dummies: Create dummy / indicator variables flexibly and efficiently. 2012. URL: https://CRAN.R-project.org/package=dummies

5. Hunter J. D. Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 2007.-P. 90-95. URL: https://doi.org/10.1109/MCSE.2007.55

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

6. Louppe G., Wehenkel L., Sutera A. & Geurts P. Understanding variable importances in forests of randomized trees. 2013.- P. 431-439.

7. Mc Kinney W. Data Structures for Statistical Computing in Python. In S. van der Walt & J. Millman (Eds.), Proceedings of the 9th Python in Science Conference 2010.- P. 51-56.

8. Morgan L. A. & Eckert J. K. Retirement Financial Preparation. Journal ofAging & Social Policy, 16(2), 2004.- P. 19-34. URL: https://doi.org/10.1300/J031v16n02_02

9. National Center for Health Statistics. About the National Health Interview Survey. January 16. 2019. URL : https://www.cdc.gov/nchs/nhis/about_nhis.htm

10. NHIS 2018. Sample Adult File. (n.d.). National Center for Health Statistics. Retrieved January 3, 2020. From: URL: ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHIS/2018/samadultcsv.zip

11. NHIS 2018 Survey Description. National Center for Health Statistics. 2019. URL: ftp://ftp.cdc.gov/ pub/Health_Statistics/NCHS/Dataset_Documentation/NHIS/2018/srvydesc.pdf

12. Oliphant T. NumPy: A guide to NumPy. 2006. URL: http://www.numpy.org

13. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M. & Duchesnay E. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2011.- P. 2825-2830.

14. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. 2019. URL: https://www.R-project.org

15. Raschka S. Python machine learning, 2015 a.- P. 56-68. Packt Publishing Ltd.

16. Raschka S. Python machine learning, 2015 b.- P. 106-108. Packt Publishing Ltd.

17. Raschka S. Python machine learning, 2015 c.- P. 368-373. Packt Publishing Ltd.

18. Rich Morin & Richary Fry. More Americans Worry about Financing Retirement. Pew Research Center. 2012. URL: https://www.pewsocialtrends.org/2012/10/22/more-americans-worry-about-financing-retirement/

19. U. S. Social Security Administration. (n.d.). Full Retirement Age by Year ofBirth. Benefits Planner: Retirement. Retrieved January 2, 2020. From: URL: https://www.ssa.gov/planners/retire/agereduction.html

20. Virtanen P., Gommers R., Oliphant T. E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J., van der Walt S. J., Brett M., Wilson J., Jarrod Millman K., Mayorov N., Nelson A. R.J., Jones E., Kern R., Larson E., ... Contributors S. 1. 0. 2019. SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python. ArXiv E-Prints, arXiv:1907.10121.

21. Wickham H. ggplot 2: Elegant Graphics for Data Analysis. Springer-Verlag - New York. 2016. URL: https://ggplot2.tidyverse.org

22. Wickham H., François R., Henry L., & Müller K. dplyr: A Grammar of Data Manipulation. 2019. URL: https://CRAN.R-project.org/package=dplyr

23. Wickham H. & Henry L. tidyr: Tidy Messy Data. 2019. URL: https://CRAN.R-project.org/ package=tidyr

24. Witten I., Frank E., & Hall M. Data Mining: Practical Machine Learning Tools and Techniques. Burlington, MA 01803. 2011.- P. 102-103.

i Надоели баннеры? Вы всегда можете отключить рекламу.