Научная статья на тему 'PREDICTING ADOLESCENT PHYSICAL ACTIVITY: DEVELOPMENT AND VALIDATION OF TWO PREDICTIVE MODELS'

PREDICTING ADOLESCENT PHYSICAL ACTIVITY: DEVELOPMENT AND VALIDATION OF TWO PREDICTIVE MODELS Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
64
13
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
PHYSICAL ACTIVITY / YOUTH RISK BEHAVIOR SURVEILLANCE SURVEY / PREDICTIVE MODEL / LOGISTIC REGRESSIONS

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Jiaao Bao, Dr. Jinan Liu

Physical Activity plays an imperative role in adolescents’ body and mental development. Regular physical activities can help reduce the risk of developing diseases like type II diabetes, while being physical inactive will lead to increased risk of cardiovascular disease. For both parents and teachers, they need to have a robust instrument to evaluate adolescents’ physical activity condition. In this report, response data of 9.045 high school students of 14 to 17 years old from the 2017 Youth Risk Behavior Surveillance Survey are analyzed. Several pre-processing techniques such as missing value exclusion, and min-max scaling are applied to prepare the data set for model-building. Then a list of selected variables including physical attributes, demographic variables, and sleeping habits are used to develop and validate two predictive models for predicting the probability of being physical active. The predictive models are further validated by an overall evaluation of the model, statistical tests of individual predictors, and an assessment of relative importance of the independent variables. The predictive models demonstrate good and similar performance. The AUC of the models are 0.724 and 0.732, respectively. The results indicate that holding more physical education (PE) class should be the most effective way to improve adolescents’ physical activity level.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «PREDICTING ADOLESCENT PHYSICAL ACTIVITY: DEVELOPMENT AND VALIDATION OF TWO PREDICTIVE MODELS»

Section 4. Life sciences

https ://doi.org/10.29013/ELBLS -20-4-40-46

Jiaao Bao, George School

1690 Newtown-Langhorne Rd, Newtown, PA 18940 E-mail: [email protected] Dr. Jinan Liu,

PhD, Director of Outcome Research, Merck & Co

E-mail: [email protected] 770 Sumneytown Pike, West Point, PA 19486

PREDICTING ADOLESCENT PHYSICAL ACTIVITY: DEVELOPMENT AND VALIDATION OF TWO PREDICTIVE MODELS

Abstract. Physical Activity plays an imperative role in adolescents' body and mental development. Regular physical activities can help reduce the risk of developing diseases like type II diabetes, while being physical inactive will lead to increased risk of cardiovascular disease. For both parents and teachers, they need to have a robust instrument to evaluate adolescents' physical activity condition. In this report, response data of9.045 high school students of14 to 17 years old from the 2017 Youth Risk Behavior Surveillance Survey are analyzed. Several pre-processing techniques such as missing value exclusion, and min-max scaling are applied to prepare the data set for model-building. Then a list of selected variables including physical attributes, demographic variables, and sleeping habits are used to develop and validate two predictive models for predicting the probability of being physical active. The predictive models are further validated by an overall evaluation of the model, statistical tests of individual predictors, and an assessment of relative importance of the independent variables. The predictive models demonstrate good and similar performance. The AUC of the models are 0.724 and 0.732, respectively. The results indicate that holding more physical education (PE) class should be the most effective way to improve adolescents' physical activity level.

Keywords: Physical activity, Youth Risk Behavior Surveillance Survey, predictive model, logistic regressions.

1. Introduction diseases like type II diabetes, cancer and cardiovas-

Physical activities play an unignorable role in cular disease. For adolescents, regular exercise can people's daily life and have both short- and long- help them improve cardiorespiratory fitness, build term health benefits. Regular physical activities can strong bones and muscles, control weight, reduce improve health and reduce the risk of developing symptoms of anxiety and depression, and reduce

the risk of developing health conditions such as obesity. The consequences for physical inactivity, according to the Centers for Disease Control and Prevention (CDC), includes energy imbalance and the increased risk of factors for cardiovascular disease. Therefore, for the health of the students and children, it is important for schools, parents, and the government to have some understanding about the physical condition of the adolescents.

The main hypothesis of this study is that the likelihood that a high school student conducts regular physical activities is related to one or more factors such as his/her race, sex, age, weight, sleeping habit, dietary habit, smoking, use of alcohol, use of drug, etc. The main purpose of this study is to develop a predictive model to detect adolescent physical activity condition. In this study, two predictive models -logistic regression and artificial neural network are built, and their respective performance are measured. With the models, schools can collect survey data and score the students' probability of performing regular activities. For students with higher probability of inactivity, appropriate measures can be taken in early stage to improve their physical condition. The predictive model can be used to help foster physically and mentally healthy adolescents.

2. Method

2.1 Data

Using a three-stage cluster sample design, the Youth Risk Behavior Surveillance System (herein after referred to as YRBS dataset) is an epidemiologic surveillance system established by the CDC to monitor the prevalence ofyouth behaviors that most influence health [1] for 9th through 12th grade students. YRBS is a cross-sectional study and focuses on priority health-risk behaviors established during youth that result in the most significant mortality, morbidity, disability, and social problems during both youth and adulthood. These include behaviors that result in unintentional and intentional injuries; tobacco use; alcohol and other drug use; sexual behaviors that result in HIV infection, other sexually transmitted diseases (STDs), and unintended pregnancies; dietary behaviors; and physical activity, plus obesity and asthma.

The dataset of 2017 YRBS is used to identify potential associations of adolescent physical activity and the factors including dietary behavior, drinking behavior, smoking behavior, drug use, sleeping habit, etc. Observations with missing data points are excluded from the analysis. After cleaning, there are 9.045 observations for students between 14 and 17 years old in the YRBS dataset. A list of selected questions is shown in (Table 1).

Item Question Function

1 2 3

1 How old are you? Independent Variable

2 What is your sex? Independent Variable

3 In what grade are you? Independent Variable

4 Are you Hispanic or Latino? Independent Variable

6 How tall are you without your shoes on? Independent Variable

7 How much do you weigh without your shoes on? Independent Variable

30 Have you ever tried cigarette smoking, even one or two puffs? Independent Variable

40 During your life, on how many days have you had at least one drink of alcohol? Independent Variable

46 During your life, how many times have you used marijuana? Independent Variable

69 Which of the following are you trying to do about your weight? Independent Variable

Table 1.- Description of the selected questions

1 2 3

78 During the past 7 days, on how many days did you eat breakfast? Independent Variable

79 During the past 7 days, on how many days were you physically active for a total of at least 60 minutes per day? Dependent Variable

80 On an average school day, how many hours do you watch TV? Independent Variable

81 On an average school day, how many hours do you play video or computer games or use a computer for something that is not school work? Independent Variable

82 In an average week when you are in school, on how many days do you go to physical education (PE) classes? Independent Variable

84 During the past 12 months, how many times did you have a concussion from playing a sport or being physically active? Independent Variable

88 On an average school night, how many hours of sleep do you get? Independent Variable

2.2 Statistical Method

A two-stage process is involved in this statistical analysis. At stage I, techniques of missing value exclusion, dichotomizing, and min-max scaling are applied for better training purpose. Then a logistic regression and an artificial neural network model are developed with physical activity as a dependent variable and the variables from selected questions as independent variables. At stage II, several validation metrics are calculated for each model to measure and compare their relative performance.

2.2.1 Pre-processing

The data set is pre-processed in this step to improve both the training speed and accuracy. As most machine learning algorithms are not able to deal with missing values, all the data points with missing entries are excluded from training. Then the dependent variable is dichotomized, where students being physically active for a total of at least 60 minutes per day the week prior to the survey were classified as physical active and the remaining as physical inactive.

Some machine learning algorithms, such as artificial neural networks, require a specific technique called feature scaling which transforms different features into comparable scales for better training speed and accuracy. For each feature, its minimum and maximum value are first computed as x and x .

min max

Then each data point xt with respect to that feature is replaced by yi calculated as:

x - x

Finally, for training and test purposes, the YRBS dataset is partitioned into two datasets, the training dataset (70%) for model development, and the test dataset (30%) for model test.

2.2.2 Logistic Regression

Logistic regression is a part of a category of statistical models called generalized linear models, and it allows one to predict a discrete outcome from a set of variables that may be continuous, discrete, dichoto-mous, or a combination of these. Typically, the dependent variable is dichotomous, and the independent variables are either categorical or continuous. In logistic regression, each feature x. has its specific weight w.. The net input y is calculated as follows:

ln

7

1 - 7.

= w0 + wx +... + wmxm

2.2.3 Artificial Neural Network An artificial neural network is a computational model vaguely inspired by the biological neural networks that constitute animal brains. An artificial neuron that receives a signal then processes it and can signal neurons connected to it. The "signal" at a connection is a real number, and the output of each

neuron is computed by some non-linear function of the sum of its inputs.

A multilayer network is an artificial neural network that consists of one input layer, several hidden layers, and one output layer. The input layer is the first layer, the output layer is the last layer, and any layers between them are hidden layers. The data are passed into the input layer, processed by the hidden layers, and finally transformed into predicted labels in the output layer. In this study, the model has two hidden layers.

2.3 Model Validation

Consider a two-class prediction problem, where the outcomes are labeled either as positive or negative. There are four possible outcomes from a binary classifier. If the outcome from a prediction is positive and the actual value is also positive, then it is called a true positive (TP); however, if the actual value is negative then it is said to be a false positive (FP). Conversely, a true negative (TN) has occurred when both the prediction outcome and the actual value are negative, and false negative (FN) is when the prediction outcome is negative while the actual value is positive. In this way, the true positive rate (TPR) can be calculated as follows:

TP

TPR =-

TP + FN

And the false positive rate (FPR) can be calculated as:

Table 2.- Logistic

FP

FPR =-

TN + FP

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The RO C curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The best possible prediction method would yield a point in the upper left corner of the ROC space. A random guess would give a point along a diagonal line from the left bottom to the top right corners. Points above the diagonal represent better than random classification results, while points below the line represent worse than random results. In general, ROC analysis is one tool to select possibly optimal models and to discard suboptimal ones independently from the class distribution. Sometimes, it might be hard to identify which algorithm performs better by directly looking at ROC curves. Area Under Curve (AUC) overcomes this drawback by finding the area under the ROC curve, making it easier to find the optimal model.

3. Results

3.1 Logistic Regression

The results of logistic regression analysis of high school students being physically active are listed in Table 2. From the logistic regression results, it is not hard to find that, taking a 95% confidence level, question 2, 3, 6, 7, 40, 78, 81, 82, 84, 88 are significant predictors of the dependent variable.

regression results

Predictor (Item Number) Estimate Standard Error of Estimate Wald's x2 P

1 2 3 4 5

1 0.32 0.29 1.11 0.27

2 0.57 0.08 7.63 < 0.001

3 -0.55 0.21 -2.55 0.01

4 0.088 0.066 1.34 0.18

6 1.38 0.29 4.78 < 0.001

7 -0.80 0.30 -2.69 < 0.001

30 0.10 0.076 1.32 0.19

40 0.63 0.12 5.17 < 0.001

46 -0.13 0.11 -1.21 0.23

1 2 3 4 5

69 0.02 0.077 0.27 0.78

78 0.72 0.077 9.33 < 0.001

80 0.15 0.093 0.16 0.87

81 -0.61 0.077 -7.95 < 0.001

82 1.25 0.068 18.4 < 0.001

84 0.76 0.18 4.19 < 0.001

88 0.34 0.13 2.56 0.01

3.2 Artificial Neural Network

The structure of the artificial neural network is shown in (Figure 1). The thickness of the line represents the corresponding weight.

To find the relative importance of independent variables, Garson describes a method that can be used to identify the relative importance of independent variables for a single dependent variable in an artificial neural network [2]. The relative importance of a specific independent variable

q1 11

q2 12

q3 13

q4 14

q6 15

tf 16

q30 17

q40 IS

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

q46 19

q69 110

q78 111

qSO 112

qS1 113

q82 114

q84 115

qSS 116

Figure 1. Structure of the

3.3 Model Validation

Figure 3 displays the ROC curve for the two models and Table 3 lists their respective AUC score. Combining both Figure 3 and Table 3, it can be concluded that both models have achieved a rather simi-

for the dependent variable can be determined by identifying all weighted connections between the nodes of interest. That is, all weights connecting the specific input node that pass through the hidden layer to the dependent variable are identified. This is repeated for all other independent variables until a list of all weights that are specific to each independent variable is obtained [2]. Figure 2 shows the importance of each question using Garson's algorithm.

B2

01 q79_'

artificial neural network

lar performance, while the artificial neural network being slightly better than the logistic regression. Besides, we can also see that both models have results better than random guessing.

Figure 2. The importance of each question in the artificial neural network

Table 3.- The AUC score for the two models

Algorithm AUC Score

Logistic Regression 0.724

Artificial Neural Network 0.732

Figure 3. The ROC curve for the logistic regression and artificial neural network

4. Discussion

The intention of this study is to build a predictive model with the best performance and to investigate the

factors most related to adolescents' physical activities. Two models - a logistic regression and an artificial neural network - are built, and all of them have achieved a similar performance. Also, using Garson's algorithm, we are able to ascertain that the question number 6, 82, 84 are most related to adolescents' physical activity level. Table 2 corroborates with this result by showing that these questions are also significant predictors of the dependent variable. Combining the results with Table 1, we are able to see that in order to increase adolescents' physical activities, it will be most effective to hold more physical education (PE) classes.

One limitation of the study is that data entries with missing values are excluded from analyzing. This is a timesaving but defective approach. Depending on the number of such data entries, it is possible that we might remove too many sample points, resulting in losing valuable information for the model to learn the relationship among independent variables. For future studies, we may use more advanced techniques such as mean value imputation or k-near-est neighbors (kNN). The mean value imputation

method completes missing values with the mean of replaces missing values with the mean of k nearest

the entire feature. This is a simply but effective way to neighbors of that particular sample. This technique

make those entries usable by the learning algorithm. requires more efforts but can generally achieve better

Other techniques include k-nearest neighbors, which performance.

References:

1. National YRBS Data User's Manual. 2009.

2. Garson G. D. Interpreting neural network connection weights. Artificial Intelligence Expert.- 6(4): 1991. 46-51.

3. National YRBS Data Users Manual, 2017. refer to: URL: https://www.cdc.gov/healthyyouth/data/ yrbs/index.htm

4. Peng C.J., Lee K. L., Ingersoll G. M. An Introduction to Logistic Regression Analysis and Reporting. The Journal of Educational Research, 96(1), 3-14.

5. Tabachnick B. and Fidell L. Using Multivariate Statistics (4th Ed.). Needham Heights, MA: Allyn & Bacon, 2001.

6. Stat Soft, Electronic Statistics Textbook, refer to: URL: http://www.statsoft.com/textbook/stathome.html

7. Stokes M., Davis C. S. Categorical Data Analysis Using the SAS System, SAS Institute Inc., 1995.

i Надоели баннеры? Вы всегда можете отключить рекламу.