Научная статья на тему 'REGRESSION BASED ON DECISION TREE ALGORITHM'

REGRESSION BASED ON DECISION TREE ALGORITHM Текст научной статьи по специальности «Экономика и бизнес»

CC BY
157
42
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
supervised learning / Decision tree / regression analysis. / дерево решений — это дерево / внутренние узлы которого можно рассматривать как тесты (для шаблонов входных данных) / а конечные узлы — как категории (этих шаблонов). Эти тесты фильтруются по дереву / чтобы получить правильный вывод для входного шаблона. Алгоритмы дерева решений могут применяться и использоваться в различных областях. Его можно использовать в качестве замены статистических процедур для поиска данных / извлечения текста / поиска недостающих данных в классе / для улучшения поисковых систем / а также находит различные применения в медицинских областях. Было сформулировано множество алгоритмов дерева решений. Они имеют разную точность и экономичность. Нам также очень важно знать / какой алгоритм лучше всего использовать. Я обсуждаю преимущества и недостатки использования методов регрессии для анализа данных.

Аннотация научной статьи по экономике и бизнесу, автор научной работы — Malikov Azizbek Bobirovich

а decision tree is a tree whose internal nodes can be taken as tests (on input data patterns) and whose leaf nodes can be taken as categories (of these patterns). These tests are filtered down through the tree to get the right output to the input pattern. Decision Tree algorithms can be applied and used in various different fields. It can be used as a replacement for statistical procedures to find data, to extract text, to find missing data in a class, to improve search engines and it also finds various applications in medical fields. Many Decision tree algorithms have been formulated. They have different accuracy and cost effectiveness. It is also very important for us to know which algorithm is best to use. I discuss the advantages, disadavantages of using regression methods to analyze the data.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

РЕГРЕССИЯ НА ОСНОВЕ АЛГОРИТМА ДЕРЕВА РЕШЕНИЙ

дерево решений — это дерево, внутренние узлы которого можно рассматривать как тесты (для шаблонов входных данных), а конечные узлы — как категории (этих шаблонов). Эти тесты фильтруются по дереву, чтобы получить правильный вывод для входного шаблона. Алгоритмы дерева решений могут применяться и использоваться в различных областях. Его можно использовать в качестве замены статистических процедур для поиска данных, извлечения текста, поиска недостающих данных в классе, для улучшения поисковых систем, а также находит различные применения в медицинских областях. Было сформулировано множество алгоритмов дерева решений. Они имеют разную точность и экономичность. Нам также очень важно знать, какой алгоритм лучше всего использовать. Я обсуждаю преимущества и недостатки использования методов регрессии для анализа данных.

Текст научной работы на тему «REGRESSION BASED ON DECISION TREE ALGORITHM»

REGRESSION BASED ON DECISION TREE ALGORITHM

Malikov A-В.

Malikov Azizbek Bobirovich - Undergraduate, DEPARTMENT OF INFORMATION TECHNOLOGY, BUKHARA STATE UNIVERSITY, BUKHARA, REPUBLIC OF UZBEKISTAN

Abstract: а decision tree is a tree whose internal nodes can be taken as tests (on input data patterns) and whose leaf nodes can be taken as categories (of these patterns). These tests are_ filtered down through the tree to get the right output to the input pattern. Decision Tree algorithms can be applied and used in various different fields. It can be used as a replacement for statistical procedures to find data, to extract text, to find missing data in a class, to improve search engines and it also finds various applications in medical fields. Many Decision tree algorithms have been formulated. They have different accuracy and cost effectiveness. It is also very important for us to know which algorithm is best to use. I discuss the advantages, disadavantages of using regression methods to analyze the data. Keywords: supervised learning, Decision tree, regression analysis.

РЕГРЕССИЯ НА ОСНОВЕ АЛГОРИТМА ДЕРЕВА РЕШЕНИЙ

Маликов А.Б.

Маликов Азизбек Бобирович - магистрант,

кафедра информационных технологий, Бухарский государственный университет, г. Бухара, Республика Узбекистан

Аннотация: дерево решений — это дерево, внутренние узлы которого можно рассматривать как тесты (для шаблонов входных данных), а конечные узлы — как категории (этих шаблонов). Эти тесты фильтруются по дереву, чтобы получить правильный вывод для входного шаблона. Алгоритмы дерева решений могут применяться и использоваться в различных областях. Его можно использовать в качестве замены статистических процедур для поиска данных, извлечения текста, поиска недостающих данных в классе, для улучшения поисковых систем, а также находит различные применения в медицинских областях. Было сформулировано множество алгоритмов дерева решений. Они имеют разную точность и экономичность. Нам также очень важно знать, какой алгоритм лучше всего использовать. Я обсуждаю преимущества и недостатки использования методов регрессии для анализа данных. Ключевые слова: обучение с учителем, дерево решений, регрессионный анализ.

l.Introduction

Predicting the values of numeric or continuous attributes is known as regression in the statistical literature, and it is a research area for many researchers in this field. Predicting real values is also an important topic for machine learning. Most of the problems that humans learn in real life, such as sporting abilities, are continuous. Dynamic control is one such problem which is the subject of research in machine learning. For example, learning to catch a ball, moving in a three-dimensional space, is an example of this problem which is studied in robotics. In such applications, machine learning algorithms are used to control robot motions, where the response to be predicted by the algorithm is a numeric or real-valued distance measure and direction. In the paper, we review most current regression techniques developed in machine learning and statistics. After describing the main focus for the development of new techniques in the next section we review decision tree method.

2. Decision Tree Algoritm

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression.

Classification decision trees - In this kind of decision trees, the decision variable is categorical. The above decision tree is an example of classification decision tree

Regression decision trees - In this kind of decision trees, the decision variable is continuous.

The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

For instance, in the example below, decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules and the fitter the model.

# Import the necessary modules and libraries

import numpy as np

from sklearn.tree import DecisionTreeRe gressor

import matplotlib.pyplot as plt

# Create a random dataset

rng = np.random.RandomStatc( 1) X = np.sort(5 * rng.rand(80, 1), axis=0) y = np.sin(X).ravcl() y[::5] += 3 * (0.5 - rng.rand(16))

# Fit regression model

regr_1 = DccisionTrccRcgrcssor(max_dcpth=2) rcgr_2 = DccisionTrccRcgressor(max_dcpth=5) regr_1.fit(X, y) rcgr_2.fit(X, y)

# Predict

X_tcst = np.arangc(0.0, 5.0, 0.01)[:, np.ncwaxis] y_1 = rcgr_1.predict(X_tcst) y_2 = rcgr_2.predict(X_tcst)

# Plot the results plt.figurc()

plt.scattcr(X, y, s=20, cdgccolor="black", c="darkorangc", labcl="data")

plt.plot(X tcst, y_1, color="cornflowcrbluc", labcl="max_dcpth=2", lincwidth=2)

plt.plot(X_tcst, y_2, color="ycllowgrccn", labcl="max_dcpth=5", lincwidth=2)

plt.xlabcl("data")

plt.ylabcl("targct")

plt.titlc("Dccision Trec Rcgression")

plt.lcgcnd()

plt.show()

2. Important Terminology related to Tree based Algorithms

Let's look at the basic terminology used with Decision trees:

1. Root Nodc: It rcprcscnts cntire population or samplc and this furthcr gcts dividcd into two or more homogcncous scts.

2. Splitting: It is a proccss of dividing a nodc into two or morc sub-nodcs. Dccision Nodc: Whcn a sub-nodc splits into furthcr sub-nodcs, thcn it is callcd dccision nodc. Lcaf/ Tcrminal Nodc: Nodcs do not split is callcd Lcaf or Tcrminal nodc.

3.

4.

5. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.

6. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.

7. Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

These are the terms commonly used for decision trees. As we know that every algorithm has advantages and disadvantages, below are the important factors which one should know.

2.1. Advantages

1. Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical background. It does not require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily relate their hypothesis.

2. Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of decision trees, we can create new variables / features that has better power to predict target variable. You can refer article (Trick to enhance power of regression model) for one such trick. It can also be used in data exploration stage. For example, we are working on a problem where we have information available in hundreds of variables, there decision tree will help to identify most significant variable.

3. Less data cleaning required: It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree.

4. Data type is not a constraint: It can handle both numerical and categorical variables.

5. Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.

2.2. Disadvantages

l.Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in detailed below).

2. Not fit for continuous variables: While working with continuous numerical variables, decision tree looses information when it categorizes variables in different categories [5].

3. Regression Analysis

Regression analysis is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables. More specifically, Regression analysis helps us to understand how the value of the dependent variable is changing corresponding to an independent variable when other independent variables are held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement every year and get sales on that. The below list shows the advertisement made by the company in the last 5 years and the corresponding sales:

Advertisement Sales

$90 $1000

$120 $1300

$150 $1800

$100 $1200

$130 $1380

$200 ??

Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the prediction about the sales for this year. So to solve such type of prediction problems in machine learning, we need regression analysis.

Regression is a supervised learning technique_which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables. It is mainly used for prediction, forecasting, time series modeling, and determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the machine learning model can make predictions about the data. In simple words, "Regression shows a line or curve that passes through all the datapoints on target-predictor graph in such a way that the vertical distance between the datapoints and the regression line is minimum." The distance between datapoints and line tells whether a model has captured a strong relationship or not.

Some examples of regression can be as:

• Prediction of rain using temperature and other factors

• Determining Market trends

• Prediction of road accidents due to rash driving. 3.1. Why do we use Regression Analysis?

As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are various scenarios in the real world where we need some future predictions such as weather condition, sales prediction, marketing trends, etc., for such case we need some technology which can make predictions more accurately. So for such case we need Regression analysis which is a statistical method and used in machine learning and data science. Below are some other reasons for using Regression analysis:

Regression estimates the relationship between the target and the independent variable. It is used to find the trends in data. It helps to predict real/continuous values.

By performing the regression, we can confidently determine the most important factor, the least important factor, and how each factor is affecting the other factors. Types of Regression

There are various types of regressions which are used in data science and machine learning. Each type has its own importance different scenarios, but at the core, all the regression methods analyze the effect of the independent variable on dependent variables. Here we are discussing some important types of regression which are given below: Linear Regression Logistic Regression Polynomial Regression Support Vector Regression Decision Tree Regression Random Forest Regression Ridge Regression Lasso Regression 4. Summary and Conclusion

In this article, we've discussed in-depth the Decision Tree algorithm. It's a supervised learning algorithm that can be used for both classification and regression. The primary goal of decision tree is to split the dataset as a tree based on a set of rules and conditions. Lastly, we discussed the advantages and disadvantages of using decision trees. There is still a lot more to learn, and this article will give you a quick-start to explore other regression and classification algorithms.In this article, we've discussed in-depth the Decision Tree algorithm. It's a supervised learning algorithm that can be used for both classification and regression. The primary goal of decision tree is to split the dataset as a tree based on a set of rules and conditions. Lastly, we discussed the advantages and disadvantages of using decision trees. There is still a lot more to learn, and this article will give you a quick-start to explore other regression and classification algorithms

References / Список литературы

1. [Electronic Resource]. URL: https://www.tutorialspoint.com/machine_learning_with_python/classification_algorithms_decision_tree.htm/ (date of access: 25.04.2022).

2. Abou Eisha H., Amin T., Chikalov I., Hussain S., Moshkov,M. Extensions of Dynamic Programming for Combinatorial Optimization and Data Mining, Intelligent Systems Reference Library. Springer. Vol. 146, 2019.

3. Pea-Lei Tu and Jen-Yao Chung. "A New Decision-Tree Classification Algorithm for Machine Learning", Proc. of the 1992 IEEE Int. Conf. on Tools with AI Arlington, Nov. 1992.

4. Machine Learning, Tom Mitchell, McGraw Hill, 1997.

5. [Electronic Resource]. URL: https://www.analyticsvidhya.com/blog/2016/04/tree-based-algorithms-complete-tutorial-scratch-in-python/ (date of access: 25.04.2022).

6. [Electronic Resource]. URL: https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm/ (date of access: 25.04.2022).

i Надоели баннеры? Вы всегда можете отключить рекламу.