PROGRAMMING LANGUAGES IN DATA ENGINEERING: OVERVIEW, TRENDS AND PRACTICAL APPLICATION

Artemov A.

UDC 62

Artemov A.

Senior Data Engineer, Schwarz Media Plattform GmbH

Mülheim, Germany

PROGRAMMING LANGUAGES IN DATA ENGINEERING: OVERVIEW, TRENDS AND PRACTICAL APPLICATION

Abstract

The time in which people now live has been marked by rapid changes in all spheres of life. Which in turn was caused by scientific and technical changes and the implementation of new technologies. In this connection, mastering programming skills is a key stage in the development of any novice specialist in the field of data processing. In order to choose a suitable programming language, it is worth taking a look at what tasks are solved by specialists in the field of data analysis in everyday practice. If we talk about data analytics, then experts in this field are technically savvy experts who use mathematical and statistical methods to process, analyze and extract meaning from data. There are many areas in the field of data analysis, including machine learning, deep learning, network analysis, natural language processing and geospatial analysis. Data analysts rely on the power of computers to perform their tasks. Programming becomes a key tool that allows them to interact with computers and transmit the necessary instructions to them.

There are a huge number of programming languages developed for various purposes. In this article, the author will consider several programming languages that are relevant in the field of data analysis at the moment, 2023, and their features and potential.

The methodology of this article includes an extensive analysis of scientific publications, articles and research on such a topic as programming languages in Data Engineering: overview, trends and practical application. Which in turn will allow you to get an idea of the current state and development trends in this area.

Keywords:

programming languages, data engineering, application of programming languages in data engineering, application experience.

Introduction

The growth rate of the volume of generated data continues to increase constantly, moving forward at an almost predictable pace. According to Seagate UK, by 2025, the global volume of processing data will reach an astounding 175 zettabytes. This extraordinary amount of data has become an important resource for businesses that are not only increasing their dependence on data, but also actively finding new ways to use data in their work.

2010 2011 2012 2013 20V! 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 202S Fig.1 - The growth rate of the volume of generated data according to Seagate UK

Organizations work with data in a wide range of operations, ranging from analyzing current activities and predicting future trends, up to modeling customer behavior, risk prevention and developing new products. Data development and processing have become cornerstones in all these aspects of activity.

The concept of Data Engineering is an indispensable component of modern practice, it means the creation and maintenance of systems capable of collecting, managing, converting and storing data in a form suitable for use, as well as providing access to this data for various data scientists, business analysts and others.

The exceptional importance of Data Engineering lies in the fact that it gives additional value to data and ensures that it is accessible to all stakeholders. In essence, Data Engineering allows you to make data more valuable and easily accessible to their consumers.

The main tasks and functions of Data Engineer are the design, implementation and support of infrastructure, systems and processes designed for processing and storing structured and unstructured data. As a result, high quality and reliability of data flows are ensured. One of the key processes that data engineers are engaged in is the so-called ETL processes, an abbreviation that includes English terms:

• Extract.

• Transform.

• Load [2].

Table 1

Basic skills required by a Data Engineer

Skill name General characteristics

Principles of software engineering Data Engineers successfully apply Agile, DevOps and DataOps, design outstanding architectural solutions and develop service-oriented systems.

Ownership of distributed systems Important competencies of Data Engineers are the skills of programming and architectural design of distributed systems.

Expertise in open platforms They own Apache Spark, Hadoop, possibly Hive, MapReduce, Kafka and other similar solutions.

Knowledge of SQL SQL remains an integral part of the database and is a key element of the Data Engineer's work. Knowledge of various database solutions (SQL and NoSQL), ETL/ELT tools and operating systems (Linux, Ubuntu) is also an integral part of their skills

Knowledge of programming languages Python has become the main tool in data processing, while Java, although it has lost its initial popularity among specialists, still has its place. Scala, used in Apache Spark and Kafka, also has its audience.

Ability to work with the Pandas library Pandas is an essential Data Engineers tool for data processing and management.

Data visualization and dashboards Data Engineers know how to create informative data visualizations.

Analytical skills Data Engineers should have some understanding of statistical analysis and mathematical principles for correct data processing and their preparation for final analysis.

Data modeling Understanding how to structure tables, normalize and denormalize data in storage, and think about attributes is key for a Data Engineer.

These skills form the basis for the successful work of a Data Engineer and make them specialists in demand in the world of data processing.

I. Data engineering process

Data engineering is a process that involves the consistent execution of tasks that transform vast amounts of raw data into a practical product that meets the needs of analysts, data scientists, machine learning engineers and other specialists. This process usually consists of a series of steps shown in Figure 2.

Fig.2 - Simplified structure of the data engineering process

During the Data Ingestion process, information is transferred from numerous heterogeneous sources, such as SQL and NoSQL, databases, loT devices, websites, streaming services, and so on, to the designated system for the purpose of its subsequent adaptation for analysis. This data is presented in various formats and can be either structured or unstructured.

At the Data Transformation stage, data that is initially unsystematic is transformed taking into account the requirements of end users. This stage includes the identification and correction of errors, as well as getting rid of duplicates, normalization and conversion to the required format.

Data Serving is engaged in the transfer of transformed data to end users, whether they are business intelligence platforms, dashboards or a team of data analysis specialists.

Data flow orchestration, on the other hand, provides full transparency in the data engineering process, ensuring the successful completion of all tasks. It coordinates and continuously monitors data processing processes to identify and eliminate problems with data quality and accuracy.

The key tool automating the stages of Data Ingestion, Data Transformation and Data Serving is the data pipeline. The data pipeline combines tools and operations that move information from one system to another for subsequent storage and further processing. The development and maintenance of data pipelines is one of the main tasks of data engineers. In addition, they develop scripts to automate repetitive tasks, known as jobs.

Data pipelines are often used for:

• moving data between different systems or environments, including transferring information from internal companies to cloud data warehouses;

• processing data and converting it into a format suitable for analysis, business intelligence and machine learning projects;

• integration of data from various IoT systems and devices;

• copying tables and data between databases [3].

II. Programming languages used

Python is currently the most popular programming language, with endless possibilities in various fields. Its flexible and dynamic nature makes it an ideal choice for data development, analysis and maintenance. In the world of data, Python has become one of the key tools needed to create data pipelines, configure statistical models and conduct their thorough analysis. Many companies around the world use Python for data processing to gain competitive advantages and a deeper understanding of their data. Python in the field of data development includes data processing, including shape-shifting, aggregation, combining data from various sources, performing small-scale ETL, interacting with APIs, and automation.

There are many reasons for Python's popularity. One of its main advantages is its wide distribution. Python is one of the three most popular programming languages in the world.

Python is a universal programming language that is easy to use and provides a variety of libraries for accessing databases and data storage technologies. It has become a popular tool for performing ETL tasks. Many teams prefer Python for data development instead of specialized ETL tools because of its versatility and power.

Python is also widely used by machine learning and artificial intelligence teams. It has become a universal language in this area, which facilitates communication between different teams. Python supports popular technologies, such as Apache Airflow, and integrates with libraries for popular tools, such as Apache Spark. If your company uses such tools, knowledge of Python is extremely important.

Comparing Python with Java in the field of Data Engineering, Python is popular mostly due to its ease of use, concise syntax and convenience. It is an ideal choice for data processing, scientific research, big data analysis and machine learning [4].

Scala is a statically typed programming language that is gaining popularity among data developers due to its unique combination of features that make it ideal for working with large amounts of data. The static typing system in Scala has impressive versatility, allowing you to encode a significant amount of information about the behavior of the program in types. This provides increased code correctness, which makes Scala especially useful for handling rarely used execution paths. Unlike dynamic languages, Scala is able to detect errors at the compilation stage, which helps to avoid persistent errors.

One of the common critical drawbacks in statically typed object-oriented languages, such as Java, is their bulkiness. Scala, being a functional language, uses type inference, which allows the compiler to automatically determine the types of variables. This makes the Scala code more concise and readable without compromising type safety. The compiler outputs types for all variables in the function body if the types of arguments and return values are specified. This elegant approach to type inference greatly simplifies the code development process and makes Scala an attractive choice for data engineers and developers.

However, it should be noted that Scala is still lagging behind in library availability compared to some other languages, especially in the field of Data Science. Despite efforts towards creating similar tools such as Spark Notebooks and Apache Zeppelin, popular tools such as IPython Notebook in combination with matplotlib remain unsurpassed in the field of data research [5].

Attention should also be paid to such a relatively young programming language as Rust, which was introduced in 2010, but was able to quickly gain popularity and in 2022 it became the seventh programming language. Which in turn is due to the fact that it is very fast and quite economical in terms of memory. It is also distinguished from others by its high reliability, which has become possible thanks to a rich system of ownership types and models, which together guarantee the safety of memory and thread safety. This programming language should be used in Data Engineering if:

•the tasks that you solve depend heavily on the speed and performance of fast processing of large data

sets.

•there was a need to use such a data format as Apache Arrow and libraries for processing data in memory.

However, with all its advantages, it must be remembered that the Rust compiler is very thorough and strict, and therefore we can say that Rust may be the best choice for data development if speed, performance and reliability are your top priorities, you are working with Apache Arrow, and also if security is of paramount importance [6].

Conclusion

In conclusion, we can say that the choice of a programming language in the field of Data Engineering is of great importance for the successful completion of tasks and achievement of goals.

Python, with its flexibility and rich ecosystem of libraries, remains the leading language for working with data. Its simple syntax and powerful tools make it an ideal choice for data processing and analysis, as well as for creating data pipelines.

Scala, though less common, is a statically typed and functional language that has unique features that make it attractive for solving complex problems with large amounts of data. Its static typing system and type inference capabilities make the code safer and more readable.

Thus, in the world of Data Engineering there is no universal programming language that would fit all tasks. Each of the languages considered has its advantages and disadvantages, and the right choice depends on the

specific situation. References

1. Top programming languages for Data Processing Specialists in 2023.[Electronic resource] Access mode: https://www.datacamp.com/blog/top-programming-languages-for-data-scientists-in-2022 .- (accessed 14.09.2023).

2. Data Engineer: more information about the developed program.[Electronic resource] Access mode:https://gb.ru/blog/data-engineer/ ?ysclid = lmgmv lbloh 541374999.- (publication date 14.09.2023).

3. Data Engineering: conversions, forecasts and tools.[Electronic resource] Access mode: https://habr.com/ru/articles/743308 / .- (accessed 14.09.2023).

4. The role of Python for data development: 4 critical aspects.[Electronic resource] Access mode:https://hevodata.com/learn/python-for-data-engineering /.- (accessed 14.09.2023).

5. Scala for data development: using functional programming capabilities.[Electronic resource] Access mode:https://dev.to/davidelvis/scala-for-data-engineering-harnessing-the-power-of-functional-programming-4h24 . - (accessed 14.09.2023).

6. Rust for Data Engineering—what's the hype about? [Electronic resource] Access mode: https://www.adventofdata.com/rust-for-data-engineering/. - (accessed 14.09.2023).

УДК 62

Бабаева Э.,

Преподаватель,