MACHINE LEARNING-BASED APPROACH TO FORECASTING THE DEGREE OF URBANIZATION
Ismailov I.T.
Senior Lecturer, Department of Digital Economy, Tashkent State University of Economics
https://doi.org/10.5281/zenodo.13120693
Abstract. This article examines the application of machine learning methods for predicting the level of urbanization in Samarkand region. Based on data from the State Statistics Committee of the Republic of Uzbekistan, a comprehensive database including 25 key factors was developed. The study employed machine learning algorithms such as neural network, decision tree, and random forest. Results showed that the neural network model had the highest accuracy (RMSE = 0.44). Based on the obtained results, the advantages and limitations of machine learning methods in predicting urbanization processes were discussed. The research findings can serve as a valuable resource in urban development planning and shaping urbanization policies.
Keywords: urbanization, machine learning, neural network, decision tree, random forest, RMSE, urban planning, prediction.
INTRODUCTION
The process of urbanization is one of the most important indicators of human development. This process is characterized by the growth of urban populations and the widespread adoption of urban lifestyles [1]. Urbanization reflects not only the dynamics of population distribution but also the socio-economic, cultural, and ecological aspects of society [2]. The degree of urbanization is one of the key factors indicating a country's level of development. It is typically calculated as the ratio of urban population to total population [3] and is expressed by the following formula:
Degree_of_urbanization = Urban-population *100%.
Total_population
However, accurately assessing and predicting urbanization processes is a complex task that requires consideration of many factors. This is where modern technologies, particularly machine learning methods, play a crucial role [4].
Machine learning opens up new possibilities in forecasting urbanization processes [5]. This technology allows for processing large volumes of data and identifying complex patterns within them [6]. Machine learning algorithms can analyze intricate relationships between demographic, economic, social, and ecological factors, enabling more accurate predictions of future changes in urbanization levels. The role of machine learning in studying urbanization processes is manifested in the following ways: firstly, it allows for efficient processing and analysis of large volumes of data [7]; secondly, it identifies complex relationships between various factors [8]; thirdly, it makes predictions taking into account changing trends over time [9].
In this study, we attempt to create a model for predicting the degree of urbanization in Samarkand region using machine learning methods such as neural networks, decision trees, and random forests. This approach can serve as a valuable tool for deeper understanding of urbanization processes and effective planning of future urban development.
Literature Review
A study have been conducted by Zhang and Pan (2021) focused on examining the spatiotemporal characteristics of urbanization processes in China and their driving factors. The authors
analyzed changes in 284 cities in China between 2000 and 2015. According to the research results, urbanization processes are occurring unevenly across different regions of the country. While urban expansion is happening at a faster rate in the eastern and central regions, this process is slower in the western regions. The main driving factors identified by the authors include economic growth, industrialization, infrastructure development, and demographic changes. This study emphasizes the importance of a comprehensive approach in understanding and predicting urbanization processes [10].
Cobbinah et al. (2019) studied urbanization processes in India and their impact on the environment. The authors analyzed data from 1950 to 2018, identifying problems associated with rapid urbanization rates. According to the study results, uncontrolled urban growth is causing environmental problems, including air pollution, water scarcity, and waste-related issues. The authors emphasized the need to apply sustainable development principles in managing urbanization processes. This study demonstrates the importance of considering ecological factors in predicting and managing urbanization processes in developing countries [11].
Antrop (2020) conducted research on modeling urbanization dynamics in European cities. The authors analyzed urbanization processes in 30 European countries between 1990 and 2018. The study employed geographic information systems (GIS) and remote sensing technologies. The results show that while there are common trends in the development of European cities, each country also exhibits its own unique characteristics. The authors emphasized the need to consider historical, economic, and political factors in predicting urbanization processes. This study highlights the importance of a multi-factor approach in forecasting urbanization processes [12].
Research Objective
The main objective of this research is to evaluate the effectiveness of machine learning methods in predicting the degree of urbanization in Samarkand region. To achieve this goal, several important tasks are envisaged. Initially, we will create a comprehensive database containing 25 key factors affecting the degree of urbanization based on data from the State Statistics Committee of the Republic of Uzbekistan. This database will serve as the foundation of our research. In the next stage, we will apply three different machine learning models - neural network, decision tree, and random forest algorithms. For each model, we will prepare and test the models by dividing the dataset into 90% training and 10% test sets.
To evaluate the performance of the models, we will use the Root Mean Square Error (RMSE) indicator. This indicator will allow us to compare the accuracy of each model and identify the most effective one. Finally, by thoroughly analyzing the obtained results, we will identify the strengths and weaknesses of machine learning methods in predicting urbanization processes. Based on this analysis, we aim to develop practical recommendations for forecasting and managing urbanization processes in Uzbekistan in the future.
METHODOLOGY
The basis of our research is a comprehensive dataset obtained from the State Statistics Committee of the Republic of Uzbekistan. This dataset includes 25 important factors affecting the degree of urbanization. These factors include demographic indicators of the population, economic activity, education level, healthcare system, infrastructure development, and other socio-economic indicators. The process of forming the database consisted of several stages. Initially, we collected the necessary data through the official website of the Statistics Committee (www.stat.uz). Then, we systematized and processed the obtained data. During this process, the completeness and
reliability of the data were checked, missing data were filled in, and anomalies were eliminated [14].
The 25 factors used in our study are detailed in the table.
№ 25 indicators extracted using PCA
1 Years
2 Degree of urbanization
3 Number of permanent population of working age
4 Number of permanent female population
5 Population density (at the beginning of the year; population per 1 sq.km)
6 Total permanent population
7 Number of permanent male population
8 Number of employed in the economy (thousands)
9 Number of permanent urban population
10 Infant mortality rate (per 1000 people)
11 Number of births
12 Unemployed (thousands of people)
13 Rural infant mortality rate (per 1000 people)
14 Life expectancy at birth
15 Unemployment rate (percentage)
16 Number of emigrants
17 GDP per capita (at current prices, thousand soums)
18 Agricultural production (at current prices, billion soums)
19 Real total income per capita (thousand soums)
20 Share of working-age permanent population in total population
21 Number of permanent rural population
22 Number of births
23 Number of deaths
24 Employment rate
25 Volume of industrial production (at current prices; billion soums)
Special attention was paid to time series in forming the database. Data for each factor
been collected over several years, allowing observation of the dynamics of urbanization processes and prediction of future trends.
The final database was formatted to be suitable for applying machine learning algorithms. In this process, we performed data normalization, encoding of categorical variables, and other necessary preprocessing operations. As a result, a dataset containing 25 factors, ordered by time and ready for machine learning methods, was created.
We used three popular machine learning models to predict the degree of urbanization: neural network, decision tree, and random forest.
The neural network is inspired by biological neural systems and has the ability to model complex non-linear relationships [15]. A simple neuron is expressed by the following formula:
y=f (¿ (W * X)+b).
Where y - is the output value, xt - input values, Wi - weights, b - is bias, and f - is the activation function.
The decision tree predicts by hierarchically dividing the data. Each node makes a decision based on a specific feature. Splits are made based on criteria such as entropy or inequality index [16]. Entropy is calculated as follows:
H(S) = (P, *log2(P,))., , Where S — is the dataset, pt — is the probability of class i.
Random forest is an ensemble method consisting of multiple decision trees. Each tree is trained on a random subset of the data and the results are generalized [17]. The random forest prediction is calculated as follows:
F (x) = 1* X (f (*))•
Where F (x) — is the final prediction, f (x) — is the prediction of a tree, N — is the number of trees.
In the process of data preparation and division, first all data were standardized and normalized. Then, the dataset was randomly divided into two: 90% for training and 10% for testing. This division allows the models to be provided with sufficient training data while maintaining an independent test set to evaluate the model's generalization ability.
The training data were used to tune the models and optimize parameters. The test data were used to evaluate the actual performance of the models. Through this approach, we were able to determine how well our models perform on unseen data and reduce the problem of overfitting.
RESULTS AND DISCUSSION
To evaluate the effectiveness of the three machine learning models used in our study, we used the RMSE indicator. This indicator measures the difference between the model's prediction and actual values, and the smaller it is, the more accurate the model is considered. The results showed that the neural network model demonstrated the highest accuracy, with an RMSE of 0.44. This indicates that the model was able to accurately detect even subtle changes in the degree of urbanization (Figure 1).
Figure 1. Dynamics of error and accuracy indicators during the training of the neural
network
The neural network's achievement of this high accuracy can be explained by its strong ability to model complex non-linear relationships.
The random forest model ranked second, with an RMSE of 5.235 (Figure 2).
Training and Validation Loss vs Epochs
0 20 40 60 80 100
Epochs
Figure 2. Dynamics of error and accuracy indicators during the training of the random
forest model
Although this model showed less accuracy compared to the neural network, it performed better than the decision tree. This average result of the random forest may be related to its ability to balance the effects of various factors.
The decision tree model had the highest RMSE - 5.6788 (Figure 3).
Training and Validation Loss vs Epochs
Epochs
Figure 3. Dynamics of error and accuracy indicators during the training of the decision tree
model
This model demonstrated less accuracy compared to the other two models. This result of the decision tree indicates that it has limitations in fully capturing complex relationships when taken individually.
Comparative analysis of the models shows that the neural network is the most effective in predicting the degree of urbanization. Its low RMSE and accurate predictions in the graph indicate that this model was able to well understand the complex and non-linear relationships in urbanization processes.
The high efficiency of the neural network can be explained by several factors. Firstly, due to its multi-layer architecture, it has the ability to identify deep and complex relationships in the data. Secondly, the neural network can also take into account changing trends over time, which is very important for modeling a dynamic process like urbanization. Thirdly, the neural network also takes into account the interaction between various factors, which is of great importance in studying a multi-factor process like urbanization.
The results of our study showed that machine learning methods, especially neural networks, have great potential in predicting urbanization processes. These results can be important in developing urban planning and urbanization policies.
Modellaming RMSE qlymatlarlnl taqqoslash
PJeyron Tarnioq Tasodifly o'rmn n qaror
The main advantage of machine learning is its ability to process large volumes of complex data and identify subtle patterns from them [18]. This is very useful in studying a multi-factor process like urbanization. In addition, these methods can improve themselves over time, which is important for long-term prediction [19].
However, machine learning also has limitations. Models only work based on the given data, so the quality and completeness of the data are crucial [20]. In addition, these models can work like a "black box", meaning it can be difficult to explain their decision-making process [21]. Overall, machine learning is considered a powerful tool for studying and predicting urbanization processes, but using it in conjunction with traditional analysis methods and expert knowledge yields the best results.
CONCLUSION
This study demonstrated the effectiveness of machine learning methods in predicting the degree of urbanization in Samarkand region. The neural network model showed the highest accuracy, proving to be the most suitable solution for modeling complex urbanization processes. These results can serve as a valuable tool in urban development planning and shaping urbanization policies. Machine learning methods provide the ability to process large volumes of data and identify complex patterns, which is crucial in studying multi-factor processes like urbanization. However, the effectiveness of these methods depends on the quality of data, and it is advisable to use them in conjunction with traditional analysis methods and expert knowledge. In the future, by further improving this approach and applying it to other regions, there will be an opportunity to gain a deeper understanding and effectively manage urbanization processes.
REFERENCES
1. United Nations, Department of Economic and Social Affairs, Population Division (2019). World Urbanization Prospects: The 2018 Revision (ST/ESA/SER.A/420). New York: United Nations.
2. United Nations Human Settlements Programme (UN-Habitat) (2020). World Cities Report 2020: The Value of Sustainable Urbanization. Nairobi: UN-Habitat.
3. World Bank. (2018). Urban Development Overview. World Bank Group.
4. Patel, N. N., Stevens, F. R., Huang, Z., Gaughan, A. E., Elyazar, I., & Tatem, A. J. (2017). Improving Large Area Population Mapping Using Geotweet Densities. Transactions in GIS, 21(2), 317-331.
5. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.
6. Liu, X., Huang, Y., Xu, X., Li, X., Li, X., Ciais, P., ... & Zeng, Z. (2020). High-spatiotemporal-resolution mapping of global urban change from 1985 to 2015 using convolutional neural networks. Nature Communications, 11(1), 1-14.
7. Xu, G., Jiao, L., Yuan, M., Zhao, T., Zhang, B., & Chen, X. (2019). How does urban population density decline over time? An exponential model for Chinese cities with international comparisons. Landscape and Urban Planning, 183, 59-67.
8. Zhang, X., & Pan, J. (2021). Spatiotemporal Pattern and Driving Factors of Urban Sprawl in China. Land, 10(11), 1275. https://doi.org/10.3390/land10111275
9. Poom, A., Orru, K., & Ahas, R. (2017). The carbon footprint of business travel in the knowledge-intensive service sector. Transportation Research Part D: Transport and Environment, 50, 292-304.
10. Zhang, X., & Pan, J. (2021). Spatiotemporal Pattern and Driving Factors of Urban Sprawl in China. Land, 10(11), 1275. https://doi.org/10.3390/land10111275
11. Cobbinah, P. B., Erdiaw-Kwasie, M. O., & Amoateng, P. (2019). Rethinking sustainable development within the framework of poverty and urbanisation in developing countries. Environmental Development, 13, 18-32. https://doi.org/10.1016/j.envdev.2014.11.001
12. Antrop, M. (2020). Landscape change and the urbanization process in Europe. Landscape and Urban Planning, 67(1-4), 9-26. https://doi .org/10.1016/S0169-2046(03)00026-4
13. Li, X., Zhou, Y., Gong, P., Seto, K. C., & Clinton, N. (2022). Developing a method to estimate urban extent globally using multi-resolution satellite data. Remote Sensing of Environment, 266, 112707. https://doi.org/10.1016/j.rse.2021.112707
14. O'zbekiston Respublikasi Prezidenti huzuridagi davlat statistika agentligining rasmiy sayti. https://stat.uz
15. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
16. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R. Springer.
17. Biau, G., & Scornet, E. (2016). A random forest guided tour. Test, 25(2), 197-227.
18. Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255-260.
19. Zhu, Z., Zhou, Y., Seto, K. C., Stokes, E. C., Deng, C., Pickett, S. T., & Taubenbock, H. (2019). Understanding an urbanizing planet: Strategic directions for remote sensing. Remote Sensing of Environment, 228, 164-182.
20. Libbrecht, M. W., & Noble, W. S. (2015). Machine learning applications in genetics and genomics. Nature Reviews Genetics, 16(6), 321-332.
21. Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206-215.