COMPARATIVE ANALYSIS OF THREE LEADING PREDICTIVE MODELS FOR LUNG CANCER DETECTION

Anel Abdugulovaaizhan Altaibek; Gulbanu Abdugulova Kaldygul Kushimbayeva; Valeriy Makarov

COMPARATIVE ANALYSIS OF THREE LEADING PREDICTIVE MODELS FOR LUNG

CANCER DETECTION

ANEL ABDUGULOVA,AIZHAN ALTAIBEK

International Information Technology University, Manas Str 34/1, Almaty, 050000, Kazakhstan

GULBANU ABDUGULOVA, KALDYGUL KUSHIMBAYE VA

Sanzhar Asfendiyarov Kazakh national medical University, Tole bi 94, Almaty 050012, Kazakhstan

VALERIY MAKAROV

Almaty regional multidisciplinary clinic, 69a Rosa Baglanova Str Almaty 050010, Kazakhstan

Abstract. Lung cancer is considered as one of the significant challenges for global public health and makes development of effective early detection methodologies necessary to augment patient outcomes. In recent years in order to enhance both the prognosis and early detection of lung cancer integration of predictive modeling and machine learning (ML) techniques has emerged as a promising field. This article compares three forefront ML models: Support Vector Machines (SVM), Random Forest, and K Nearest Neighbors Classification(K-NN) for their proficiency in lung cancer identification.

Conduction of a comprehensive evaluation which encompass efficacy, precision, and robustness, endeavors inquiry to explain the comparative strengths and weaknesses of these models, thereby facilitating informed decision-making regarding their clinical deployment. The investigation encompasses a spectrum of methodologies, ranging from cutting-edge deep learning algorithms to traditional statistical approaches, each offering distinct advantages in terms of computational efficiency and predictive efficacy.

Keywords. Lung cancer, Predictive modeling, Machine learning, Early detection, Support Vector Machines (SVM), Random Forest, Convolutional Neural Networks (CNN), Comparative analysis

INTRODUCTION

Globally, the prevalence of cancer and its aftereffects have been rising quickly in recent years. Although the causes are numerous and intricate, they include changes in the distribution and prevalence of the main cancer risk factors, many of which are connected to socioeconomic development, as well as the aging and growth of the population.[1][2] A number of countries have witnessed sharp declines in the mortality rates of stroke and coronary heart disease relative to cancer as a result of the world's population aging and rising cancer rates.

A global estimate of the age-standardized incidence rates by 2020 is provided in Figure 1. It includes both males and females in the age range of 0-74 and covers all cancer types. The information below is accessible online.

Consequently, early detection of lung cancer plays a vital role in improving patient survival outcomes, as timely intervention can substantially enhance its treatment efficacy and overall survival rates and advancement of machine learning and predictive modeling techniques considered as promising potential in aiding the early diagnosis and prognosis of lung cancer.

Figure 1: Worldwide estimated age-standardized incidence rates in 2020

This study focuses only on research on lung cancer, which is considered as one of the most widespread and common cancers in the world. Also, it is one of the most significant medical and socioeconomic issues in recent years has continued to be lung cancer [3]. According to a statistically significant association, the main causes are cardiopulmonary syndrome, cigarette smoking, and air pollution [4][5].

This pathology claims the lives of up to 1.8 million people annually. The International Agency for Research on Cancer (IARC) reports that over 2 million men and women are diagnosed with lung cancer annually, with men accounting for two thirds of cases (1,368,524) and women for one third (725,352). According to the International Agency for Research on Cancer [8], which presents statistics on the incidence of lung cancer by 2020, Kazakhstan is ranked among the top ten countries.

Machine learning (ML) has expanded quickly in recent years as a result of advances in algorithmic creativity, capacity processing, and data collection.[6] Machine learning techniques play a crucial role in solving complex problems across multiple industries, including banking, healthcare, image recognition, and natural language processing. To determine which machine learning approach is most appropriate for a particular task, it is imperative to understand the benefits and drawbacks of the different approaches.

This study gives a comprehensive comparative analysis and contrast 3 machine learning techniques in this article to help practitioners, data scientists, and researchers choose the best tool for their specific applications.[7]

METHODS AND RESEARCH a. Dataset Description

In this research was used dataset "Lung cancer" [13], which consists information about each lung cancer diagnosed during the trial, including additional parameters such as smoking history, chest pain and etc. in the same individual.

Overall dataset description demonstrated in Figure 3 below:

Dataset : (309, 16) <cíass 'pandas.core.frame.DataFrame'> Rangelndex: 309 entriesj 0 to 368

Data columns (total 16 columns)

# Column Non -Null . Court Dtype

0 GENDER 309 noti- null object

1 AGE 309 nor- null int64

2 SMOKING 309 nor- null int64

3 YELLOWFIMGERS 309 nor- null int64

4 ANXIETY 309 nor- null int64

5 PEERPRESSURE 309 nor- null int64

6 CHROMIC DISEASE 309 non- null int64

7 FATIGUE 309 non- null irt64

8 ALLERGY 309 non- null irt64

9 WHEEZING 309 nor- null irt64

10 ALCOHOL CONSUMING 309 nor- null irt64

11 COUGHING 309 nor- null irt64

12 SHORTNESS OF BREATH 309 nor- null irt64

13 SWALLOWING DIFFICULTY 309 nor- null irt64

14 CHEST PAIN 309 noti- null int64

15 LUNG_CANCER 309 noti- null object

dtypes: int64(14), object(2) memory usage: 38.8+ KB

Figure 2: Overall description of lung cancer dataset.

In Figure 4 below illustrated number of patients diagnosed with cancer.

LUNOCANCER

Figure 3: Statistic of lung cancer cases in dataset.

A comprehensive dataset containing clinical, demographic, and imaging data of cancer patients is required to achieve optimal results and higher accuracy. Thorough preprocessing procedures, such as feature selection, data cleaning, and normalization, are performed on the data to guarantee its quality and suitability for model development. Preprocessing will produce a training set that is appropriate for data training. [9]

When predicting lung cancer using machine learning models, it's important to consider factors such as data availability, model complexity, interpretability, and performance. Here are three commonly used machine learning models that have shown promise in predicting lung cancer: 1. Support Vector Machines (SVM)

SVM is known as a powerful supervised learning algorithm which is commonly used for classification tasks. By finding the optimal hyperplane that best separates the classes in the feature space it works well for both linearly separable and non-linearly separable data. SVMs can process nonlinear data by mapping it into a multidimensional space, and they are effective for binary classification. Although their high generalization performance is well known, large datasets may necessitate the use of substantial computational resources.

In order for the machine support vector classifier to function, it must first determine the decision boundary in a way that allows the data points to be separated as much as possible into classes during construction by means of this hyperplane in the object space. It is known as the maximum margin classifier as a result.[12] SVM has been successfully applied in various medical diagnosis tasks, including lung cancer prediction, by using features extracted from medical imaging data like CT scans or X-rays.

Application of SVM model to lung cancer dataset gives accuracy of 0.8064516129032258, which considers as a good result.

: # Test score

scoresvmcla = svmcla.score(X_test¡ Ytest) print(scoresvmcla)

0.8064516129032258

Overall, SVMs are a powerful and versatile machine learning model that can be effectively used for predicting lung cancer by leveraging their ability to handle high-dimensional data, robustness to overfitting, capability to model non-linear relationships, optimal margin properties, feature selection capabilities, good performance, and interpretability.

2. Random Forest:

By combining several decision trees into an ensemble model, Random Forests mitigate the overfitting of decision trees. [11] This method is excellent at lowering variance and raising accuracy; it is used in fraud detection, bioinformatics, and image classification. A Random Forest classifier, which consists of numerous decision trees, is an illustration of an ensemble learning method. Compared to other approaches, it is more effective and has an easy-to-understand framework. When it comes to various classifier types, the most demanding consideration is their ability to adapt to problem space settings and their independence from the data domain.

As a result, by classifying a data point and combining the predictions of several trees, the Random Forest classifier performs better. The following features were chosen: FC, logFC, and P-value, in order to train the machine learning model for the dataset in the research "Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers" by Lavanya C1, Pooja S1, Abhay H Kashyap2, and Abdur Rahaman. Application of Random Forest model to lung cancer dataset gives accuracy of 0.9032258064516129, which considers as a very promising result compare to previously obtained accuracy from SVM model.

: # Test score

scorerfcla = rfcla.score(X_testj Ytest) print(scorerfcla)

0.9032258064516129

Overall, Random Forest is a versatile and powerful algorithm that can be well-suited for predicting lung cancer due to its ensemble learning nature, ability to handle high-dimensional data, robustness to overfitting, feature importance analysis, capability to capture non-linear relationships, scalability, and ease of tuning.

3. K Nearest Neighbors Classification(K-NN)

K Nearest Neighbors Classification(K-NN) is one of the most common machine learning classifiers, which is easy to understand visually. KNN works well for clustering and recommendation systems but for large datasets can be computationally expensive.

From the perspective of pattern recognition, the K-NN algorithm is a non-parametric method used for classification and regression. In the feature space K-number of the closest training samples form an Input and a class membership forms an output. If K = 1, then the class is single nearest neighbor. [10] K-NN is a type of instance-based learning.

Initialization, define the parameter K + -

Calculate the distance between the test sample and all the training samples

I

Son the Distance

I

I ake k-nearest neighbour

I

Gather the category ol' nearest neighbour

I

Apply simple majority of category

Figure 4: Implementation steps of K-NN algorithm.

Application of K Nearest Neighbors Classification(K-NN) to lung cancer dataset gives accuracy of 0.8387096774193549, which considers as an optimistic result compare to previously considered models.

RESULTS AND DISCUSSION

Natural language processing, image recognition, healthcare, and finance are just a few of the areas where machine learning methods have proven to be invaluable in resolving complex problems. The current specific prerequisites of a given issue, the information, and the accessible computational assets ought to be generally thought about while choosing an AI method.

Basic to choose the strategy best fits the job that needs to be done in light of the fact that everyone has particular characteristics, benefits, and detriments. Therefore, careful model selection and evaluation are required to ensure that the chosen method is compatible with the problem's objectives.

0,92 0,90 0,88 0,86 0,84 0,82 0,80 0,78 0,76 0,74

Figure 5: Comparison of accuracy results of 3 considered ML models

Based on the accuracy results regarding the predictive efficacy of lung cancer detection, the Random Forest model exhibited a notably superior accuracy in contrast to its counterparts, notwithstanding their commendable performances. This superiority can be attributed to the incorporation of feature selection, regression, and classification methodologies. The model operates by aggregating an ensemble of discerningly chosen features and amalgamating their prognostications.

Random Forest is an appealing choice for many real-world applications because it is resistant to noise and outliers, manages high-dimensional datasets effectively and yields estimates of feature relevance. Support Vector Machines and K Nearest Neighbors Classification also yielded encouraging outcomes, affirming their efficacy as proficient models for addressing the challenge of lung cancer prediction.

CONCLUSION:

Cancer incidence and mortality rates are on the rise globally, driven by factors such as population aging, evolving cancer risk profiles, and socioeconomic progress. Notably, some nations are witnessing a surpassing number of cancer-related fatalities compared to those attributed to coronary heart disease and stroke. A nuanced examination of Asian nations reveals notably elevated cancer incidence rates, particularly prominent in Japan and South Korea. Within the spectrum of malignancies, lung cancer persists as a formidable global health challenge, strongly linked to risk determinants such as atmospheric pollution and tobacco consumption.

The realm of machine learning remains dynamic, with ongoing research continually yielding novel algorithms and enhancements to existing methodologies. Ensuring currency with the latest innovations and maintaining adaptability to employ the most fitting technique for each application are pivotal for success within the machine learning domain. As machine learning advances, it furnishes indispensable tools for addressing intricate cancer-related endeavors, spanning from predictive models for risk assessment to gauging treatment efficacy.

The comparative analysis furnished herein serves as an initial guidepost for selecting machine learning methodologies, yet sustained exploration and empirical investigation are indispensable for attaining optimal outcomes across diverse applications.

Drawing from the findings of this comparative study, it is evident that extensive research endeavors have been dedicated to lung cancer prediction, culminating in noteworthy outcomes. However, it is imperative to underscore that all models exhibited commendable performance, and further advancements in this realm hold substantial promise for ameliorating mortality rates and facilitating earlier detection of lung cancer.

Accuracy results

SVM Random forest K-NN

This study necessitates further inquiry to delve into the interplay between cutting-edge cancer

detection techniques and computational models, aimed at refining the accuracy of predicting early-

stage lung cancer detection.

REFERENCES

1. Abdel R. Omran. "The epidemiologic transition: A theory of the epidemiology of population change". Milbank Mem Fund Q. 1971; 49: 509-538.

2. Gersten O, Wilmoth J.R. "The cancer transition in Japan since 1951." Demogr Res. 2002; 7: 271306.

3. Kazakhstanskiy pfarmatsevticheskiy vestnik. URL: https://pharmnewskz.com/ru/article/rak-legkogo-peredovye-resheniya_18263

4. Vital, T., Panduranga. "Data collection, statistical analysis and clustering studies of cancer dataset from viziayanagaram District, AP, India." ICT and critical infrastructure. In: Proceedings of the 48th Annual Convention of Computer Society of India-Vol II. Springer, Cham, (2014)

5. Douglas, P.K., Harris, S., Yuille, A., Cohen. "Performance comparison of machine learning algorithms and number of independent components used in fMRI decoding of belief versus disbelief." Neuroimage 56(2), 544-553 (2011)

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

6. George Tzanis, Ioannis Katakis and Ioannis Partalas. "Modern Applications of Machine Learning." Department of Informatics, Aristotle University of Thessaloniki, GR-54124.

7. Caichen Li, Huiting Wang and Yu Jiang. "Advances in lung cancer screening and early detection." Cancer Biol Med. 2022 May 15; 19(5): 591-608.

8. International Agency for Research on Cancer 2023. URL: https://shorturl.at/bdjN2

9. S. B. Kotsiantis, D. Kanellopoulos and P. E. Pintelas. "Data Preprocessing for Supervised Leaning" INTERNATIONAL JOURNAL OF COMPUTER SCIENCE VOLUME 1 NUMBER 1 2006 ISSN 1306-442.

10. Wordpress.com. "A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm" 2010 URL:https://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/

11. Vrushali Y Kulkarni and Dr Pradeep K Sinh. "Random Forest Classifiers :A Survey and Future Research Directions". International Journal of Advanced Computing, ISSN:2051-0845, Vol.36, Issue.1

12. Cortes C and Vapnik V. "Support-vector networks." Mach Learn. 1995;20:273-297.

13. https://www.kaggle.com/code/hasibalmuzdadid/lung-cancer-analysis-accuracy-96-4/i nput? s el ect=survey+l ung+ cancer.c sv

COMPARATIVE ANALYSIS OF THREE LEADING PREDICTIVE MODELS FOR LUNG CANCER DETECTION Текст научной статьи по специальности «Медицинские науки и общественное здравоохранение»

Аннотация научной статьи по медицинским наукам и общественному здравоохранению, автор научной работы — Anel Abdugulovaaizhan Altaibek, Gulbanu Abdugulova Kaldygul Kushimbayeva, Valeriy Makarov

Похожие темы научных работ по медицинским наукам и общественному здравоохранению , автор научной работы — Anel Abdugulovaaizhan Altaibek, Gulbanu Abdugulova Kaldygul Kushimbayeva, Valeriy Makarov

Текст научной работы на тему «COMPARATIVE ANALYSIS OF THREE LEADING PREDICTIVE MODELS FOR LUNG CANCER DETECTION»