EFFICIENT STREAM DATA PROCESSING FOR BACKEND MODEL TRAINING

I.A. Kuznetcov

I.A. Kuznetcov, bachelor's degree Udmurt State University (Russia, Izhevsk)

DOI:10.24412/2500-1000-2024-11-2-223-228

Abstract. This paper examines approaches to stream data processing (SDP) for training machine learning (ML) models on the backend. Architectural solutions, including Lambda and Kappa architectures as well as microservice approaches, are explored with a focus on their advantages and limitations under modern conditions. Tools such as Apache Kafka, Apache Flink, and Apache Spark Streaming are analyzed, emphasizing their applicability to various data processing tasks. Special attention is given to performance optimization methods, including the use of online learning and incremental learning algorithms, data compression, efficient serialization, and resource management. The article presents examples of technology implementation demonstrating their practical value.

Keywords: stream data processing (SDP), architectural solutions, performance optimization, machine learning (ML), microservice architecture.

In the era of digitalization, stream data processing (SDP) plays a crucial role in building intelligent systems and making real-time decisions. These data streams come from a wide range of sources and require timely processing for further use in machine learning (ML) models. At the server level, tasks such as data collection, preprocessing, filtering, and transmission are solved, which makes effective flow management especially important for overall system performance.

However, working with SDP comes with a set of significant challenges. Moreover, training ML models requires maintaining a balance between data volume and its relevance, which implies effective resource management and adaptive architectural solutions. The aim of this paper is to explore methods and architectural solutions for optimizing SDP.

Main part. Characteristics of stream data

A continuous flow of information arriving in real-time from various sources constitutes SDP. Unlike static datasets, which are fixed at a specific moment in time and processed under conditions of known volume and structure, SDP is characterized by high velocity, uncertain volume, and the need for real-time processing. These features impose unique requirements on data processing systems, making their development and operation complex tasks.

One of the most important problems is the high data rate. Information sources, such as Internet of Things (IoT) sensors or video surveillance systems, can generate a huge amount of

data in very short periods of time. This requires data processing systems to scale both vertically and horizontally to handle increasing loads. Otherwise, delays may arise, rendering the processing inefficient [1]. The limitations of computational resources exacerbate the problem, as SDP flows often exceed the capabilities of traditional storage and processing systems. Using all incoming data without prior filtering or compression may lead to memory and processor overloads. For instance, systems processing data from thousands of sensors may face operational memory shortages for storing the current state of all streams, resulting in failures.

The need for real-time data processing introduces additional complexities. For many applications, such as financial systems, monitoring systems, and predictive analytics, processing delays can devalue the data. For example, in a fraud detection system, delayed analysis can lead to unauthorized transactions, while in monitoring systems important events may go unnoticed. Maintaining low latency requires optimizing algorithms and infrastructure, which is not always feasible with limited resources. Data loss is another issue that can arise due to the high velocity of information or equipment failures. In streaming systems, ensuring data delivery from source to processing is a challenging task, and losses can significantly impact outcomes.

Processing delays are also common in streaming systems. They may be caused by network congestion, computational bottlenecks, or

- TexnuHecKue nayHU -

storage access issues. In systems where temporal synchronization is crucial, such delays can lead to incorrect results, affecting model training or analytical accuracy. Improper data processing is another issue related to the variability and complexity of data structures. For example, data from different sources may contain errors, duplicates, or invalid values. If such data is not handled correctly, it can lead to incorrect conclusions or cause models to be trained on distorted data, negatively affecting their performance.

These challenges collectively highlight the complexity of working with SDP and the need

for specialized architectures and algorithms. Modern solutions must ensure high performance, reliability, and flexibility in data processing to meet the demands of dynamically evolving applications and systems.

Architectural approaches to SDP

An important task in modern systems working with real-time data is SDP. Effective processing requires selecting architectural approaches that ensure performance, reliability, and scalability. Several architectural models address different aspects of data processing. Among the most popular are Lambda and Kappa architectures (fig. 1).

Data stream

Fig. 1. Comparison of lambda and kappa architectures [2]

The Lambda architecture is a hybrid approach that divides data processing into two layers: the real-time layer and the batch layer. Batch processing is applied to historical data, allowing for accurate and stable results, while stream processing ensures minimal latency for real-time data. The results are merged through a serving layer, providing access to the processed data. The Kappa architecture focuses exclusively on stream-based technologies, eliminating batch processing. All data is processed in real time, with historical data being reprocessed through event replay. This approach simplifies the system while maintaining its performance and adaptability.

In modern contexts, the Lambda architecture is criticized for its complexity and the need to maintain two parallel pipelines, which increases development and operational costs. With advancements in stream processing technologies, the Kappa architecture has become more favorable due to its simplicity and focus on stream processing.

Another popular architectural approach to SDP is the microservice architecture. This approach decomposes data processing tasks into small, independent modules (microservices), each solving a specific data-related task and interacting with other components via application programming interface. Containerization tools (Docker) and orchestration platforms (Kuber-netes) play a significant role in managing such systems.

The choice of an architectural approach depends on the specific tasks and system requirements. The Lambda architecture is suitable for complex systems requiring a combination of precise analytics and rapid response. The Kappa architecture is simpler and more efficient for tasks that require only real-time data processing. The microservice architecture offers modularity and flexibility, making it a universal solution for a wide range of tasks. Each architecture can be adapted to specific conditions, ensuring an optimal balance between performance and complexity.

Tools and technologies for SDP

In an era of increasing data volumes and the growing demand for real-time processing, the development and use of efficient tools for SDP have become a priority. Modern solutions enable real-time data analysis, minimizing delays and ensuring high system performance. Among the most used tools are Apache Kafka, Apache Flink, and Apache Spark Streaming. These technologies are applied in various scenarios, including analytical systems, monitoring, data integration, and training ML models.

Apache Kafka is a distributed messaging platform designed for transmitting events between system components. Its main advantage lies in its high performance and scalability. Kafka can process millions of messages per second while maintaining low latency. Thanks to built-in data replication, it ensures a high level of reliability and fault tolerance. However, one limitation of Apache Kafka is that it is oriented toward data transportation and does not provide built-in tools for analysis. Performing analytics requires integration with other tools, increasing the complexity of configuration and management.

Apache Flink provides capabilities for both stream and batch data processing. This platform is notable for its minimal latency in event pro-

In the author's opinion, the choice of the appropriate tool depends on the specifics of the

cessing and its support for complex computations, including state management and ML. One of Flink's primary advantages is its versatility and ability to handle real-time analytics tasks efficiently. However, using Flink involves high computational resource requirements, making it less accessible for smaller-scale applications. Moreover, configuring and integrating Flink with existing infrastructure can be complex and resource intensive.

Apache Spark Streaming, as an extension of the Apache Spark platform, offers a solution for SDP through the concept of micro-batches. This approach enables seamless integration of stream and batch analytics, making Spark Streaming especially useful for tasks requiring ML methods or complex analytical tools. However, the micro-batch processing model results in small delays, which can be important for applications with strict real-time requirements. Furthermore, it demands significant computational resources to handle large data volumes, potentially limiting its use in resource-constrained environments.

A comparison of these technologies based on performance, scalability, and ease of integration highlights the unique advantages of each tool (table 1).

task. Apache Kafka excels at data transportation tasks, ensuring reliability and scalability. Apache

Table 1. Comparative analysis of tools and technologies for SDP [3, 4]

Characteristic Apache Kafka Apache Flink Apache Spark Streaming

Efficiency High efficiency in data transmission due to its distributed message processing architecture. High efficiency in processing large data volumes, supported by integration with advanced analytics tools. Moderate efficiency due to the use of micro-batches, suitable for stream processing.

Scalability Very high scalability. Supports horizontal scaling through cluster-based architecture. High scalability with dynamic resource balancing for complex analytical tasks. High scalability but requires significant computational resources as the load increases.

Reliability High reliability ensured through mechanisms like data replication and log-based storage. Reliability is supported by checkpoints and a robust recovery system in case of failures. Reliable system with built-in error handling and recovery mechanisms.

Ease of integration Easy integration with other systems due to a large number of ready-to-use connectors. Integration is possible but requires more complex configurations, especially for nonstandard use cases. Seamless integration with other Apache products, including Ha-doop and HDFS.

Analytics support Limited. Focused mainly on fast message transmission; analytics require external tools. Extensive analytics capabilities with a rich set of built-in stream processing features. Advanced analytics support, especially for real-time data streams.

Latency Minimal latency, making it an optimal choice for critical message delivery applications. Minimal latency, suitable for tasks requiring high data accuracy. Moderate latency, which might be a disadvantage for applications with strict response time requirements.

- TexnuHecKue nayHU -

Flink offers advanced stream analytics capabilities with minimal latency, suitable for complex computations. Apache Spark Streaming is effective for tasks that combine stream and batch analytics. The combination of these technologies enables the creation of complex and highperformance systems that meet the diverse requirements of data processing scenarios.

Algorithms and approaches to SDP optimization

Representing a highly complex task, SDP requires the use of efficient algorithms and strategies to ensure minimal latency, reliability, and scalability. Implementing ML algorithms, as well as optimizing their processing workflows, are crucial aspects of building high-performance systems.

The online learning methodology is one of

the most applicable approaches to ML tasks on SDP. This method allows models to update in real-time as new data arrives, eliminating the need for a complete model recalibration on the entire dataset. This is especially relevant in scenarios where data arrives at high speed, and the model must adapt to changing conditions, such as in recommendation systems or predictive analytics. The main advantage of online learning is its integration with streaming platforms like Apache Flink, which simplifies the implementation of complex algorithms. However, one challenge of this approach is selecting model parameters that remain optimal under constantly changing data conditions.

The incremental learning approach advances the principles of online learning by gradually updating the model through the integration of new data into the existing structure. This enables consideration of historical data within the context of new arrivals, which is particularly important for tasks requiring seasonal or long-term trend analysis, such as time series forecasting. This method minimizes storage and processing costs while reducing the computational load [5].

Optimizing real-time data processing is essential to reduce latency and increase system throughput. One approach involves compressing data as it is received. Technologies like gzip or snappy reduce the volume of data transferred between system nodes, lowering network bandwidth and storage requirements. However, the compression and decompression processes demand additional computational resources, neces-

sitating a balance between data volume reduction and processing overhead.

Another aspect of optimization is efficient data serialization. Formats such as Avro or Protocol Buffers provide compact data representation and fast deserialization processes. These formats support schema structures, making them suitable for handling variable-structured data. For example, in high-load systems where complex data structures need to be transmitted, using Avro can reduce transmission delays and ensure data consistency. Additionally, their integration with popular streaming platforms simplifies their application.

Resource management, including vertical and horizontal scaling, is integral to SDP optimization. Vertical scaling, which involves enhancing the performance of a single node by adding processors or memory, is effective for tasks requiring high performance from one node. However, this approach is limited by hardware capabilities and can be costly [6]. Horizontal scaling, on the other hand, involves adding new nodes to handle increasing data volumes. This approach offers greater flexibility and fault tolerance, making it the preferred solution for most modern distributed systems.

Thus, the implementation of ML algorithms and the optimization of data processing workflows form the foundation of high-performance SDP systems. These methods strike a balance between performance, reliability, and resource efficiency, making them essential components of modern real-time data processing architectures.

In recent years, there has been active implementation of SDP optimization algorithms and methods across various industries. These technologies enable the efficient analysis of large volumes of real-time information, enhancing performance and service quality. JPMorgan Chase actively employs streaming data processing technologies for transaction monitoring and fraud prevention. The bank has implemented a platform based on Apache Kafka and Apache Flink, enabling real-time processing of millions of transactions. These technologies help the bank promptly identify suspicious activities and prevent financial losses. Streaming data processing has also improved risk management and enhanced the efficiency of capital management systems [7].

- TexHuuecKue HayKU -

Walmart uses Apache Kafka to process SDP related to customer purchases and behavior. This system allows real-time data collection and analysis, helping personalize customer recommendations and improve inventory management. These technologies enable Walmart to optimize supply chains, reduce costs, and enhance customer satisfaction by offering products that best match their needs [8]. Thus, the adoption of SDP optimization algorithms and methods across various industries demonstrates significant benefits, including improved efficiency, cost reduction, and enhanced service quality.

Conclusion

Efficient SDP plays a vital role in modern information systems, ensuring timely and reliable transmission, filtering, and analysis of real-time information. They are characterized by a high rate of receipt and variability of structure, which requires the use of specialized architectural approaches such as Lambda and Kappa architec-

tures, as well as microservice organization of systems. Technologies like Apache Kafka, Apache Flink, and Apache Spark Streaming have proven effective in addressing streaming challenges, providing tools for minimizing latency, scalability, and data integration. These capabilities make them indispensable for important applications, including financial systems, IoT monitoring, and retail.

Algorithms such as online learning and incremental learning, alongside optimization methods like data compression and efficient serialization, ensure the adaptability of ML models and reduce data processing costs. Real-world examples demonstrate the practical benefits of these technologies in fraud prevention and enhancing user experience. Modern SDP methods lay the foundation for intelligent systems capable of handling increasing data volumes, improving business efficiency, and supporting the growth of the digital economy.

References

1. Hsu K. Big data analysis and optimization and platform components //Journal of King Saud University-Science. - 2022. - Vol. 34. - № 4. - P. 101945.

2. Zhao M., Agarwal N., Basant A., Gedik B., Pan S., Ozdal M., Pol P. Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product // Proceedings of the 49th annual international symposium on computer architecture. - 2022. - P. 1042-1057.

3. Shojaee Rad Z., Ghobaei-Arani M. Data pipeline approaches in serverless computing: a taxonomy, review, and research trends // Journal of Big Data. - 2024. - Vol. 11. - № 1. - P. 1-42.

4. Aluev A. Scalable web applications: a cost-effectiveness study using microservice architecture // Cold Science. - 2024. - № 8. - P. 32-38.

5. Haryani D. Enhancing Mobile App User Experience: A Deep Learning Approach for System Design and Optimization. - 2024.

6. Sidorov D. Cross-browser compatibility issues and solutions in web development // ISJ Theoretical & Applied Science. - 2024. - Vol. 139. - № 11. - P. 18-21.

7. Kumar P., Gowda D. Y., Prakash A. M. Machine Learning in Cybersecurity: A Comprehensive Survey of Data Breach Detection, Cyber-Attack Prevention, and Fraud Detection // Pioneering Smart Healthcare 5.0 with IoT, Federated Learning, and Cloud Security. - 2024. - P. 175-197.

8. Shastry K. A., Manjunatha B. A. Intelligent Analytics in Big Data and Cloud: Big Data; Analytics; Cloud //Intelligent Analytics for Industry 4.0 Applications. - CRC Press, 2023. - P. 85-112.

ЭФФЕКТИВНАЯ ОБРАБОТКА ПОТОКОВЫХ ДАННЫХ ДЛЯ ОБУЧЕНИЯ МОДЕЛЕЙ

НА BACKEND

И.А. Кузнецов, бакалавр

Удмуртский государственный университет (Россия, г. Ижевск)

Аннотация. В данной статье рассматриваются подходы к обработке потоковых данных (ОПД) для обучения моделей машинного обучения (МО) на backend. Исследуются архитектурные решения, включая Лямбда- и Каппа-архитектуры, а также микросервисные подходы, их преимущества и ограничения в современных условиях. Анализируются инструменты, такие как Apache Kafka, Apache Flink и Apache Spark Streaming, с акцентом на их применимость к различным задачам обработки данных. Особое внимание уделяется методам оптимизации производительности, включая использование алгоритмов online learning и incremental learning, сжатие данных, эффективную сериализацию и управление ресурсами. В статье представлены примеры внедрения технологий, которые демонстрируют их практическую ценность.

Ключевые слова: обработка потоковых данных (ОПД), архитектурные решения, оптимизация производительности, машинное обучение (МО), микросервисная архитектура.

АНАЛИЗ СПОСОБОВ СНИЖЕНИЯ LER-ЭФФЕКТА ПРИ СОЗДАНИИ ФОТОННЫХ

ИНТЕГРАЛЬНЫХ СХЕМ

М.С. Кульпинов, ассистент В.В. Лосев, директор А.Г. Балашов, проректор А.Л. Переверзев, проректор А.Ю. Красюков, доцент

Национальный исследовательский университет «Московский институт электронной техники»

(Россия, г. Москва)

DOI:10.24412/2500-1000-2024-11-2-229-232

Производство интегральной микросхемы было выполнено за счет средств Минобрнауки России в рамках федерального проекта «Подготовка кадров и научного фундамента для электронной промышленности» по гос. заданию на выполнение научно-исследовательской работы «Разработка методики прототипирования электронной компонентной базы на отечественных микроэлектронных производствах на основе сервиса MPW(FSMR-2023-0008)

Аннотация. В работе проведен анализ способов и экспериментальных технологий по уменьшению неровностей фоторезистивной маски, воздействующие на характеристики ФИС, связанные с оптимизацией процесса фотолитографии, использование специализированных материалов и технологий их нанесения при формировании слоев антиотражающих покрытий и жесткой маски, применение дополнительных технологий и способов обработки поверхности формируемых слоев.

Ключевые слова: неровности края линии фоторезиста, LER-эффект, фотонные интегральные схемы.

По мере того, как размеры устройств и элементов в полупроводниковой промышленности продолжают уменьшаться, технология формирования топологического рисунка элементов с малыми линейными размерами приобретает все большее значение при изготовлении современных интегральных схем. Существующие методы создания топологического рисунка в фоторезистивной маске (ФРМ)

могут привести к образованию неровностей края линии и шероховатостей боковых стенок фоторезиста [1].

Известно, что LER-эффект характеризует величину шероховатости или неровности краев фоторезистивных масок, создаваемых в процессе производства полупроводников и определяет отклонение краев элементов маски от идеальной прямой линии (рис. 1).

Рис. 1. SEM-изображение LER-эффекта

EFFICIENT STREAM DATA PROCESSING FOR BACKEND MODEL TRAINING Текст научной статьи по специальности «Компьютерные и информационные науки»

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — I.A. Kuznetcov

Похожие темы научных работ по компьютерным и информационным наукам , автор научной работы — I.A. Kuznetcov

ЭФФЕКТИВНАЯ ОБРАБОТКА ПОТОКОВЫХ ДАННЫХ ДЛЯ ОБУЧЕНИЯ МОДЕЛЕЙ НА BACKEND

Текст научной работы на тему «EFFICIENT STREAM DATA PROCESSING FOR BACKEND MODEL TRAINING»