Научная статья на тему 'Development of a model to predict the risk factors for heart disease'

Development of a model to predict the risk factors for heart disease Текст научной статьи по специальности «Клиническая медицина»

CC BY
109
22
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
PREDICTORS OF HEART DISEASE / THE RISK FACTORS FOR HEART DISEASE / HEART DISEASE / PREDICTIVE MODEL / HEART RATE

Аннотация научной статьи по клинической медицине, автор научной работы — Wang Tianxin

Objective: This study aims to: 1) examine the predictors of heart disease, 2) build a predictive model for heart disease using artificial neural network and compare its performance to logistic regression model. Methods: A public database was used for this study. This dataset focuses on the prediction of indicators/diagnosis of heart disease. The features cover demographic information, habits, and historic medical records. All the participants who were eligible were randomly assigned into 2 groups: training sample and testing sample. Two models were built using training sample: artificial neural network and logistic regression. We used these two models to predict the risk of heart disease in the testing sample. Receiver operating characteristic (ROC) were calculated and compared for these two models for their discrimination capability and a curve using predicted probability versus observed probability were plotted to demonstrate the calibration measure for these two models. Results: About 54.5% (n = 165) of 303 were patients with heart disease; 75% of the 96 female people and 45% of the 207 male people with heart disease. The male is 82.8% (1-0.172) less likely to have heart disease than the female. Patients with chest pain were 136% (2.363-1) more likely to have heart disease than the patients without chest pain. The chance to have a heart disease increased by 2% when the maximum heart rate achieved increased by 1. Patients who had exercise induced angina were 62.5% (1-0.375) less likely to have heart disease. Patients who had ST depression induced by exercise were 41.7% (1-0.583) less likely to have heart disease. The chance to have heart disease decreased by 53.9% when the number of major vessels (0-3) colored by flourosopy increased by 1. Patients without thalassemia were less likely to have heart disease. According to this neural network, the top 5 most important predictors were number of major vessels (0-3) colored by flourosopy (CA), resting electrocardiographic results (restecg), serum cholestoral in mg/dl (chol), maximum heart rate achieved (thalach), thalassemia. For training sample, the ROC was 0.94 for the Logistic regression and 0.99 for the artificial neural network. Artificial neural network performed better clearly. However in testing sample, the ROC was 0.90 for the Logistic regression and 0.85 for the artificial neural network. Artificial neural network had worse performance. Conclusions: In this study, we identified several important predictors for heart disease e. g., sex, heart rate. This provided important information for providers and patients to provide timely intervention.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Development of a model to predict the risk factors for heart disease»

Section 3. Preventive medicine

Wang Tianxin, Westtown School, PA E-mail: 271690S305@qq.com

DEVELOPMENT OF A MODEL TO PREDICT THE RISK FACTORS FOR HEART DISEASE

Abstract

Objective: This study aims to: 1) examine the predictors of heart disease, 2) build a predictive model for heart disease using artificial neural network and compare its performance to logistic regression model.

Methods: A public database was used for this study. This dataset focuses on the prediction of indicators/diagnosis of heart disease. The features cover demographic information, habits, and historic medical records.

All the participants who were eligible were randomly assigned into 2 groups: training sample and testing sample. Two models were built using training sample: artificial neural network and logistic regression. We used these two models to predict the risk of heart disease in the testing sample. Receiver operating characteristic (ROC) were calculated and compared for these two models for their discrimination capability and a curve using predicted probability versus observed probability were plotted to demonstrate the calibration measure for these two models.

Results: About 54.5% (n = 165) of 303 were patients with heart disease; 75% of the 96 female people and 45% of the 207 male people with heart disease. The male is 82.8% (1-0.172) less likely to have heart disease than the female. Patients with chest pain were 136% (2.363-1) more likely to have heart disease than the patients without chest pain. The chance to have a heart disease increased by 2% when the maximum heart rate achieved increased by 1. Patients who had exercise induced angina were 62.5% (1-0.375) less likely to have heart disease. Patients who had ST depression induced by exercise were 41.7% (1-0.583) less likely to have heart disease. The chance to have heart disease decreased by 53.9% when the number of maj or vessels (0-3) colored by flourosopy increased by 1. Patients without thalassemia were less likely to have heart disease.

According to this neural network, the top 5 most important predictors were number of major vessels (0-3) colored by flourosopy (CA), resting electrocardiographic results (restecg), serum cho-lestoral in mg/dl (chol), maximum heart rate achieved (thalach), thalassemia.

For training sample, the ROC was 0.94 for the Logistic regression and 0.99 for the artificial neural network. Artificial neural network performed better clearly. However in testing sample,

the ROC was 0.90 for the Logistic regression and 0.85 for the artificial neural network. Artificial neural network had worse performance.

Conclusions: In this study, we identified several important predictors for heart disease e.g., sex, heart rate. This provided important information for providers and patients to provide timely intervention.

Keywords: predictors of heart disease, the risk factors for heart disease, heart disease, predictive model, heart rate.

1. Instruction

Heart disease is the leading cause of death for both men and women. More than half of the deaths due to heart disease in 2015 were in men. About 630.000 Americans die from heart disease each year - that's 1 in every 4 deaths [1]. Coronary heart disease is the most common type of heart disease, killing about 366,000 people in 2015. In the United States, someone has a heart attack every 40 seconds [2]. Each minute, more than one person in the United States dies from a heart disease-related event. Heart disease is the leading cause of death for people of most racial/ethnic groups in the United States, including African Americans, Hispan-ics, and whites. For Asian Americans or Pacific Islanders and American Indians or Alaska Natives, heart disease is second only to cancer. Heart disease costs the United States about $200 billion each year. This total includes the cost ofhealth care services, medications, and lost productivity.

High blood pressure, high LDL cholesterol, and smoking are key heart disease risk factors for heart disease. About half ofAmericans (49%) have at least one of these three risk factors [3].

This study aims to 1) examine the predictors of heart disease 2) build a predictive model for heart disease using artificial neural network and compare its performance to logistic regression model.

2. Data and Methods:

Data:

Data Set Information:

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date.

The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1, 2, 3, 4) from absence (value 0). Attribute Information:

Only 14 attributes used: 1. #3 (age) 2. #4 (sex) 3. #9 (cp) 4. #10 (trestbps) 5. #12 (chol) 6. #16 (fbs) 7. #19 (restecg) 8. #32 (thalach) 9. #38 (exang) 10. #40 (oldpeak) 11. #41 (slope) 12. #44 (ca) 13. #51 (thal) 14. #58 (num) (the predicted attribute)

The data could be downloaded at: https://www. kaggle.com/ronitf/heart-disease-uci Variables:

Table 1.- Variables used in this study

ageage in years

sex(1 = male; 0 = female)

cpchest pain type

trestbpsresting blood pressure (in mm Hg on admission to the hospital) cholserum cholestoral in mg/dl fbs(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

restecgresting electrocardiographic results thalach: maximum heart rate achieved exang: exercise induced angina (1 = yes; 0 = no) oldpeak: ST depression induced by exercise relative to rest

slope: the slope of the peak exercise ST seg-mentca: number of major vessels (0-3) colored by flourosopy

thal 3 = normal; 6 = fixed defect; 7 = reversable defect target1 or 0

3. Results

About 54.5% ( n = 165) of303 were patients with heart disease; 75% of the 96 female people and 45% of the 207 male people with heart disease.

Basically, a corrgram is a graphical representation of the cells of a matrix of correlations. The idea is to display the pattern of correlations in terms of their signs and magnitudes using visual thinning and

correlation-based variable ordering. Moreover, the cells of the matrix can be shaded or colored to show the correlation value. The positive correlations are shown in blue, while the negative correlations are

shown in red; the darker the hue, the greater the

magnitude of the correlation. Heart diseas

Figure 1. Matrix of correlations between variables

According to the logistic regression, the signifi- ang), ST depression induced by exercise (oldpeak), cant predictors are sex, chest pain, maximum heart number of major vessels (0-3) colored by flouro-rate achieved (thalach), exercise induced angina (ex- sopy (ca), thalassemia (thal)

Table 2.- Logistic Regression for Heart Disease

Estimate Std. Error z value Pr(>|z|)

1 2 3 4 5 6

(Intercept) 3.450 2.571 1.342 0.180

age -0.005 0.023 -0.212 0.832

sex -1.758 0.469 -3.751 0.000 ***

cP 0.860 0.185 4.638 0.000 ***

trestbps -0.019 0.010 -1.884 0.060

chol -0.005 0.004 -1.224 0.221

fbs 0.035 0.529 0.066 0.947

restecg 0.466 0.348 1.339 0.181

thalach 0.023 0.010 2.219 0.026 *

1 2 3 4 5 6

exang -0.980 0.410 -2.391 0.017 *

oldpeak -0.540 0.214 -2.526 0.012 *

slope 0.579 0.350 1.656 0.098

ca -0.773 0.191 -4.051 0.000 ***

thal -0.900 0.290 -3.104 0.002 **

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1

Figure 2. Artificial Neural Network in training sample

The male is 82.8% (1-0.172) less likely to have The chance to have a heart disease increased by 2%

heart disease than the female. Patients with chest when the maximum heart rate achieved increased

pain were 136% (2.363-1) more likely to have by 1. Patients who had exercise induced angina

heart disease than the patients without chest pain. were 62.5% (1-0.375) less likely to have heart dis-

ease. Patients who had ST depression induced by exercise were 41.7% (1-0.583) less likely to have heart disease. The chance to have heart disease decreased by 53.9% when the number of major vessels (0-3) colored by flourosopy increased by 1. Patients without thalassemia were less likely to have heart disease.

In above plot, line thickness represents weight magnitude and line color weight sign (black = = positive, grey = negative). The net is essentially a black box so we cannot say that much about the fitting, the weights and the model. Suffice to say that the training algorithm has converged and therefore the model is ready to be used.

Figure 3. Variable Importance in Artificial Neural Network

Figure 4. ROC in training sample for Logistic Regression (Red) vs Neural Network (Blue)

According to this neural network, the top 5 most important predictors were number of major vessels (0-3) colored by flourosopy (CA), resting electrocardiographic results (restecg), serum cholestoral in mg/dl (chol), maximum heart rate achieved (thalach), thalassemia.

For training sample, the ROC was 0.94 for the Logistic regression and 0.99 for the artificial neural network. Artificial neural network performed better clearly. However in testing sample, the ROC was 0.90 for the Logistic regression and 0.85 for the artificial neural network. Artificial neural network had worse performance.

i

0.0

I

0.2

I

0.4

I

0 6

False positive rate

Figure 5. ROC in testing sample for Logistic Regression (Red) vs Neural Network (Blue)

4. Discussions

The male is 82.8% (1-0.172) less likely to have heart disease than the female. Patients with chest pain were 136% (2.363-1) more likely to have heart disease than the patients without chest pain. The chance to have a heart disease increased by 2% when the maximum heart rate achieved increased by 1. Patients who had exercise induced angina were 62.5% (1-0.375) less likely to have heart disease. Patients who had ST depression induced by exercise were 41.7% (1-0.583) less likely to have heart disease. The chance to have heart disease decreased by 53.9% when the number of major vessels (0-3) colored by flourosopy increased by 1. Patients without thalassemia were less likely to have heart disease.

There are limitations of this study. Some known factors which might predict of heart Disease were not available in this study, like family history of heart Disease. Further we did not test the external validity neither for logistic regression nor for the ANN. However, we did a comprehensive split-sample validation with both strategies. Future studies could use outside data and test the performance of the outputs from these two models in this study.

A predictive model would be an extremely useful tool to detect heart Disease. When the variables included in our tool are available, the risk t could be easily predicted. Early detection and intervention could be made available for the people at high risk.

References:

1. Centers for Disease Control and Prevention, National Center for Health Statistics. Multiple Cause of Death 1999-2015 on CDC WONDER Online Database, released December 2016. Data are from the Multiple Cause of Death Files, 1999-2015, as compiled from data provided by the 57 vital statistics jurisdictions through the Vital Statistics Cooperative Program.

2. Heron M. Deaths: Leading causes for 2014. CDC-PDF [PDF-4.4M] National vital statistics reports. 2016; 65(5).

3. CDC. Million Hearts™: strategies to reduce the prevalence of leading cardiovascular disease risk factors. United States, 2011. MMWR2011; 60(36): 1248-51.

i Надоели баннеры? Вы всегда можете отключить рекламу.