THE INFLUENCE OF DATA SAMPLING ON SOLVING THE PROBLEM OF PATTERN RECOGNITION FOR DIAGNOSTICS OF INDUSTRIAL EQUIPMENT

Samigulina Z.I.; Baikadamova S.S.

УДК 004.896 Samigulina Z.I., Baikadamova S.S.

Samigulina Z.I.

PhD

Kazakh-British Technical University (Almaty, Kazakhstan)

Baikadamova S.S.

Kazakh-British Technical University (Almaty, Kazakhstan)

THE INFLUENCE OF DATA SAMPLING ON SOLVING THE PROBLEM OF PATTERN RECOGNITION FOR DIAGNOSTICS OF INDUSTRIAL EQUIPMENT

Аннотация: with the sophisticated technology that modern industrial organizations are equipped with, state prediction and diagnostics are essential duties. The current research aims to develop a more accurate modified artificial intelligence system for industrial equipment diagnostics in the oil and gas industry. Researching faulty signals and processing methods utilized by equipment in the oil and gas industry, as well as assessing the advantages and disadvantages of different signal extraction strategies, are the first steps in the process. The second is the application of artificial intelligence to decision-making and equipment defect detection. This method widely used by the oil and gas sectors to lower equipment failure rates. The recommended diagnostic system helps organizations reduce the financial risks associated with equipment defects by increasing production dependability, enabling for maintenance planning, predicting probable failures, and expediting equipment repairs.

The article is devoted to the study of the data sampling influence on the classifier's predictive ability in diagnosing of the industrial equipment. Various types of data samples were considered, such as: simple random sample, cluster sample, systematic sample. According to the results of listed data samples were built classifiers based on particle swarm optimization and ensemble models (bagging and voting type). The best results were achieved using the systematic sampled dataset and an ensemble modeling strategy with voting, which combines forecasting based on a neural net, gradient boosted trees and naive Bayes models: accuracy 93,6%, classification error 8%, recall 94,32%, precision 93,87%.

Ключевые слова: industrial equipment diagnostics, data sampling, simple random sample, cluster sample, systematic sample, particle swarm optimization, ensemble methods.

Introduction.

These days, automation control systems are expensive, high-tech devices with complex designs. The reliable operation of the complete automation system and the facility's superior production are ensured by the timely diagnosis and repair of such equipment. The hourly downtime of an oil refinery is measured in millions of tenges. The development of equipment diagnostic techniques is crucial in order to minimize the amount of situations that lead to facility downtime. Furthermore, by employing intelligent diagnostic procedures, human operators can delegate some monotonous tasks to a decision support system and free up more time for other tasks.

High-quality and timely diagnostics of a complex of technological tools are desperately needed to ensure the safety of the industrial process and the workforce. Artificial immune systems represent a viable approach with applications across numerous fields of research. Artificial immune systems are a bioinspired AI approach based on theoretical immunology. Currently, equipment status and defect detection are done via the AIS method. Modern production automation depends on complex, costly equipment that takes a long time to install and maintain. World-class manufacturers of technical equipment, such as Schneider Electric, Siemens, Honeywell, and others, use their models for equipment diagnostics while keeping in mind the basic diagnostic features of their goods.

Literature review.

The stable and safe operation of mechanical equipment is becoming increasingly significant in current industry, which aims to reduce unnecessary routine shutdown, maintenance costs and even sudden person casualties [1]. Thus more and more attention has been paid to fault diagnosis of machinery. Meanwhile, with the rapid development of Internet of Things, sensing technology, and big data, a new

revolution is quietly sprouting up in this field, in which a major feature is ever-increasing mass of data [2].

In general, majority of industrial equipment lacks the capability of built-in self-diagnostics and prognostics. In addition, the nominal characteristics, along with the failure modes, change over time due to wear off, maintenance, and the repair/replacement of parts and components. This reality calls for alternative approaches that can minimize the need for analysis of the specific machine failure modes [3]. Fault diagnosis approaches can be classified into three categories: modelbased [4, 5], signal-based [6-8], and data-driven approaches [9-11]. In model-based approach, focus is on establishing mathematical models of complex industrial systems. These models can be constructed by various identification methods, physical principles, etc. Signal-based approach uses detected signals to diagnose possible abnormalities and faults by comparing detected signals with prior information of normal industrial systems [12]. Usually, difficulty occurs in building accurate mathematical models and obtaining accurate signal patterns for complex industrial and process systems. Data-driven fault diagnosis approach requires large amount of historical data, rather than models or signal patterns [13]. Therefore, data-driven methods are suitable for complex industrial systems.

Over the past few years, there has been substantial advancement in the field of applying AI algorithms for diagnostics. The oil and gas business may now use less time and money thanks to the adoption of artificial intelligence techniques. By adopting the appropriate neural algorithms, machine learning has advanced this field and plays a crucial role in diagnosing machinery and correctly forecasting results said in their work Andrey Ostroukh, Leonid Berner, Maria Karelina [14]. Condition monitoring and fault diagnostic systems are crucial for lowering the likelihood that this equipment may malfunction. In this work [15] Stefania Santini, Francesco Flammini - students of Malardalen university in Sweden, University of Naples Federico 2nd, Italy, conducted defect diagnostics in rotating machinery using artificial intelligence, signal processing, and permutation entropy.

Only the initial phase in condition monitoring and maintenance involves the collecting of signals. In order to identify fault formation, it is also important to analyze the gathered monitoring data and extract features.

Finally, artificial intelligence models and methodologies are utilized to detect and forecast defects based on the retrieved information. Traditional, contemporary, and intelligent diagnosis techniques all fall under the category of signal processing technologies [16].

The obtained results require further study at the level of management decisions on the state of the equipment for the current production [17]. These days, using a project-oriented approach to manage an industrial organization is frequently linked to such solutions. In that case, various techniques were explored for the development of the classificatory according to their pros and cons in the theoretical analysis provided. For example, the Ensemble voting method provides numerous benefits within industrial equipment diagnostic systems. Benefits of using the ensemble voting technique [18-19]: Increased precision: when multiple models are combined, collective voting typically yields more stable and precise predictions compared to using individual models alone. Also, there is a greater complexity, which is combining various models and developing synthesis strategies may lead to an increase in the computational complexity and resource demands of the diagnostic system [20-21].

The problem statement of the research is formulated as follows: it is necessary to study the data sampling influence on the predictive models' effectiveness based on artificial intelligence for diagnosing industrial equipment. The goal of the study is to develop an ensemble of models to solve the problem of pattern recognition using different data samples to improve the accuracy of industrial equipment diagnostic process.

Materials and research methods.

The initial dataset of the research is full with unsystematic and unclear data and values, which is also consisted of enormous amount of information. In this section, all

the data preparation methods and further classification development techniques are provided.

Sampling methods.

A set of data plays a huge role in solving problems associated with real industrial production. During the operation of control systems, microprocessor technology generates a huge amount of production data, so the average time for sensors' information scanning by the controller is 20 m/s, while if a programmable logic controller on a production line serves up to 200 points, then the size of the technological process observation database per day will be enormous in size. Thus, the issue of reducing the dimensionality of the source data and correct data sampling is an urgent task.

In order to observe what kind of sampling is better and more efficient for particle swarm optimization and machine learning methods, it is necessary to choose correct data sampling types. All of them are probability sampling methods, because they are the best solutions for quantitative research. Let's get through each of them briefly.

Simple random sample.

There are lots of sampling methods, for instance Simple random sample. Simple random sample is a type of sampling where all the dataset information used and every member of that data has a chance to be selected for further manipulations. The sampling frame must include the entire dataset information. To perform this type of sampling, usually used such tools as random number generators or other techniques that rely entirely on chance.

In Python there is a specific library to perform such sampling methods. Algorithm 1 for the simple random sampling is as follows:

Algorithm 1. Data sampling method: simple random sample.

Step 1. Read the csv file with the number of equipment characteristics.

Step 2. Import Python library for sampling techniques by "import random". Step 3. Declare the range of the initial dataset. Step 4. Specify the random sample size as 100.

Step 5. Perform Simple Random Sampling by the following "random.sample(x,y)" command.

Step 6. Get the result of the random sampling.

Here demonstrated the simple explanation of the random sampling method (Figure 1).

â â 3 2

Figurel. Simple random sampling illustration.

It is clear from the Figure1 that by using random sampling technique it could be guaranteed that every member of the population or dataset can be chosen for the sampling group without any systematic or statistical rules.

Cluster sample.

Another one frequently used data sampling method is cluster sampling. Cluster sampling is similar to stratified sample by dividing the whole data into subgroups. However, in cluster sampling each subgroup must have similar characteristics to the entire sample. And also, by using that method it randomly selects entire subgroups not individuals. The Algorithm 2 for Cluster sampling is described below:

Algorithm 2. Data sampling method: cluster sampling.

Step 1. Read the csv file with the number of equipment characteristics.

Step 2. Import Python library for sampling techniques by "import random"

Step 3. Declare the range of the initial dataset.

Step 4. Define the number of clusters and cluster size.

Step 5. Perform randomly selecting of some clusters.

Step 6. Create a list to store the cluster samples.

Step 7. Sample all elements within the selected clusters.

Step 8. Get the result of the cluster sampling.

In Figure 2 presented the structure of the cluster data sampling method.

Clusters

Sampling group

À A 3 X 5 X

/ 7 ^ У HI

ffi 17 18

Figure 2. Cluster sampling illustration.

This data sampling method could be used in industrial equipment diagnostic

systems.

Systematic sample.

Systematic sample is quite similar to simple random sampling, but is usually a little easier to perform. Each member of the population: in our terms is each failure of the equipment is numbered and then individuals are selected at regular intervals not by a random generator. Systematic sampling method's Algorithm 3.

Algorithm 3. Data sampling method: systematic sample

Step 1. Read the csv file with the number of equipment characteristics and machine values.

Step 2. Import Python library for sampling techniques by the following "import random".

Step 3. Declare the range of the initial dataset.

Step 4. Specify the random sample size as 100.

Step 5. Calculate the sampling interval.

Step 6. Perform Systematic Sampling.

Step 7. Get the result of the random sampling.

The systematic sampling method is more effective in this study, because of its interval dividing technique, which gives an opportunity to analyze each equipment's value and not drop any important data. In Figure 3 the systematic sampling method is illustrated.

Sampling every 3rd member of the population

Figure 3. Systematic sampling illustration.

The advantages of this method are its simplicity and quick implementation.

Pattern recognition methods

As pattern recognition methods were applied Particle swarm optimization (PSO) and Ensemble method. Those methods are better explained in following paragraphs.

Particle swarm optimization

Particle swarm optimization (PSO) is one of the bioinspired algorithms, and it searches the solution space for the best possible solution in a straightforward manner. It differs from other optimization techniques in that it does not depend on the gradient or any differential form of the objective and simply requires the objective function. There are also not many hyperparameters. A particle swarm optimization operates in this manner: it begins with a number of random locations on the plane (referred to as particles) and let them search for the minimum point in random directions. Every particle should look around the lowest position it has ever found as well as the lowest point the entire swarm of particles has ever found at each step. Regard the minimal point of the function to be the least point that this swarm of particles has ever

investigated after a specific number of iterations. For better understanding in a practical vision here the pseudocode of the PSO technique is described below [22].

Pseudocode of the PSO algorithm:

Input: anm - position of the particle, vnm — velocity of the particle, i - number of iteration,

Output: the optimal solution for the particle position_

For each particle n For each dimension m

Initialize position anm randomly with possible interval Initialize velocity vnm randomly with possible interval

End_for End_for

Iteration i = 1 Do

For each particle n calculate suitable value

If the suitable value is better that TheBestnm in history

Set current suitable value as the TheBestnm

End_if

End_for

Chose the particle with the best suitable value as the TheBestm For each particle n

For each dimension m calculate velocity with the equation

^nm ^ ^nm + Vnm(Th.eBestnm x>nm.) + Vim (Th.eBestn %nm)

Update particle position with the equation:

™t + l _ y.t I -Qt + l

"■nm ~ -^nm ~ unm

End_for End_for

i = i+1

While max iteration or min error criteria are not attained._

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Particle Swarm Optimization is a metaheuristic optimization algorithm inspired by the social behavior of flocks of birds and schools of fish. It is often used to solve optimization problems, such as those encountered in research on diagnostics of industrial equipment. In this context let's present some advantages and disadvantages of using PSO.

Advantages of the method [23]:

1. Global Optimization: PSO is good at finding global optimization in complex search spaces. In the context of diagnostic methods PSO could effectively explore the space of possible diagnostic models or parameters to find the optimal solution.

2. Simple Implementation: PSO is relatively easy to implement and requires minimal parameter tuning compared to other optimization algorithms. This is beneficial in research environments where time and resources are limited.

3. Convergence Speed: often converges to a solution relatively quickly, especially for simple or moderately complex optimization problems. This is beneficial when working with large datasets or performing repeated experiments in device diagnostic studies.

4. Robustness: tends to be robust to noise and is less likely to fall into local optima compared to other optimization algorithms. This is important in research on diagnostics of industrial systems where noisy sensor data and complex interactions between components can make optimization problems difficult.

5. Scalability: it could be easily parallelized, allowing efficient optimization on high-performance computing platforms and distributed systems. This is advantageous when dealing with large industrial facilities or performing experiments that require large amounts of calculations.

Disadvantages of the method:

1. Premature Convergence: PSO prematurely converges to suboptimal solutions, especially in highly multimodal optimization environments or misleading optimization landscapes. This can be problematic in device diagnostic studies, where the lack of a global optimum can result in inaccurate or unreliable diagnostic models.

2. Limited Search: you can have difficulty effectively exploring the search space, especially when dealing with high-dimensional or non-convex optimization problems. This limitation could impact the ability to find optimal diagnostic models or parameters in complex industrial systems.

3. Parameter Sensitivity: Although PSO requires fewer parameters to be tuned than other optimization algorithms, PSO's performance is influenced by the selection of parameters such as inertia weights and acceleration factors. It may be easy to receive.

Finding optimal parameter settings may require experimentation and can be computationally intensive in research environments.

4. Noisy optimization: PSO may not perform well for optimization problems with noisy or uncertain objective functions. In research on diagnostics of industrial equipment, sensor data can be noisy or incomplete, and this limitation can affect the reliability of diagnostic models obtained with his PSO.

5. Lack of Guarantees: method does not guarantee convergence to the global optimum, especially in non-convex or discontinuous optimization landscapes. This lack of warranty can be problematic in critical industrial applications where accurate diagnostics are essential for safety and efficiency.

Overall, PSO could be useful tool when considering diagnostic methods for industrial equipment, offering advantages such as global optimization capabilities, ease of implementation, and speed of convergence.

However, researchers should be aware of its limitations, especially regarding premature convergence, limited exploration, and sensitivity to parameters, and carefully consider whether PSO is suitable for a particular optimization problem.

Ensemble (bagging and vote types).

Ensemble techniques are a significant technique in computer science and machine learning that combines numerous base models to produce a better, more robust predictive model. These strategies frequently outperform individual models by utilizing the diversity of the basis models and combining their predictions.

Ensemble approaches often rely on a set of basis models, known as weak learners or base classifiers/repressors. These basis models can be the same type (homogeneous ensembles) or different types (heterogeneous ensembles), including decision trees, neural networks, support vector machines, and any other machine learning algorithm [24].

1. Bagging (Bootstrap Aggregating) is the process of training multiple base models independently on distinct subsets of training data (sampled with replacement) and then aggregating their predictions. Random Forest is a common bagging-based

ensemble approach that uses decision trees as its basic model. However, in this research Neural Network methods are used for bagging, which is more efficient.

Advantages of ensemble bagging method [25]:

- Variance reduction: By training numerous base models on distinct subsets of the training data (sampled with replacement), bagging helps minimize the final model's variance. Each base model learns slightly different parts of the data, resulting in a more robust and stable ensemble.

- Bagging improves the generalization performance of the ensemble model. Bagging decreases the risk of overfitting by aggregating predictions from many models trained on different subsets of data and capturing more generalizable patterns in the data.

- Robustness to Noise: Because bagging integrates predictions from numerous models, it is more resistant to noisy or outlier data. Outliers may have less of an impact on the final prediction due to the averaging or voting procedure.

- Bagging base model training is easily parallelizable because each model is trained independently. This makes bagging appropriate for distributed computing systems and can result in considerable speedups in model training.

Disadvantages of the Ensemble bagging [26-27]:

-Increased Computational Cost: Training multiple base models in bagging can be computationally expensive, especially if the base model is complicated or a high number of models are included in the ensemble. This can limit bagging's scalability in large datasets or resource-constrained contexts.

- Loss of Interpretability: Because the ensemble incorporates predictions from numerous models, the final model's interpretability may be lower than that of the individual base models.

Understanding the underlying decision-making process of the ensemble may become increasingly difficult, especially with sophisticated bagging schemes or with a high number of base models.

2. Voting: In classification tasks, each base model predicts a class label, and the final prediction is selected by a majority vote. In this research for voting method were used following techniques: Naive Bayes, Neural Net and Gradient Boosted Trees.

In that article, our aim is to collect the dataset of equipment failures and apply various sampling techniques to get the desired results. Artificial intelligence and statistical methods are currently used to solve a variety of practical challenges.

Main provision

After the data sampling there is need to apply optimization methods on 4 different datasets: initial data without any changes, simple random sampled data, cluster sampled data and systematic sampled data each one respectfully. Then, it could be more convenient to chose the best suitable sampling method for further experiment of developing the classificatory.

Initial dataset explanation.

The database is taken from the kaggle equipment diagnostics data repository. It contains machine failures and process characteristics [https: //opendatacommons. org/licenses/dbcl/1-0/].

As a dataset there is a csv file with the equipment characteristics and its failure happened or not with the range R = 10x8091, 80910 data attributes. It has columns (headers) as:

Number - unique data identification,

Product ID - consisting of a letter L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number,

Type - type of the equipment L, M or H (described above),

Air temperature - generated using a random walk process later normalized to a standard deviation of 2 K around 300 K,

Process temperature - generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K,

Rotational speed - calculated from a power of 2860 W, overlaid with a normally distributed noise,

Torque - torque values are normally distributed around 40 Nm with a SD = 10 Nm and no negative values,

Machine failure - indicates, whether the machine has failed in this particular datapoint for any of the following failure modes,

HDF - heat dissipation failure: heat dissipation causes a process failure, if the difference between air- and process temperature is below 8.6 K and the tools rotational speed is below 1380 rpm,

PWF - power failure: the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 W or above 9000 W, the process fails. Here is the initial dataset of the research Table 1.

Table 1. Fragment of the equipment diagnostic database before applying data sampling:

N Product ID Type Air temperature Proces s temperature Rotation al speed Torqu e Machin e failure HD F PWF

1 M14860 M 298.1 308.6 1551 42.8 0 0 0

2 L47181 L 298.2 308.7 1408 46.3 0 0 0

3 L47182 L 298.1 308.5 1498 49.4 0 0 0

In our dataset there are 1000 rows of various failure modes. It would be more appropriate to divide this amount of failure and make a research on every classified part of them.

The 3D scatter plot of the initial database is presented (Figure 4).

Initiai dataset

Figure 4. The 3D scatter illustration of the initial dataset.

In the Figure 4 it is observed that the values of the sensors are similar and too many of them are located closer to each other, because the initial dataset is huge and consists of 10000 values of each equipment on various condition. In order to make an appropriate experiment, it is need to sample this dataset and split it into short but effective ones.

Simple random sampled dataset.

Here is the Table 2, where the random sampled data is performed. In this table "N" - is the randomized serial number of the equipment, they are written in random sampled way.

Table 2. Simple random sampling dataset.

N Product ID Type Air tempe -rature Process temperature Rotatio -nal speed Torqu e Machin e failure HD F PWF

894 M1575 3 M 295.7 306.2 1423 42.5 0 0 0

167 3 L48852 L 298.1 307.8 1432 49.8 0 0 0

234 4 L49523 L 299.1 308.3 1305 61.4 0 0 0

Simple random sampling was implemented by Python algorithm, which was explained on previous sections. Using the random sampling was generated a new dataset with range R = 10x200.

Also, in Figure 5 the 3D scatter plot of the random sampled data is shown

below.

Random sampled data

Machine Failure

9 Process temperature ê AEr temperature

12.5k

. «H J ,1 I • « r • • • • » . . . *• •

2 s m Vv

- **fWMi i j ! I . • ■ *Sr

V.Î •

*

• •«

•• • • •

V

• • •

ЗШ

j i i

Product

ША.

ÎOO

305

Process temperature, Air temperature

Figure 5. The 3D scatter illustration of the simple random sampled dataset.

In the figure above, it is seen that the data values are located separately from each other, which means that the random sampling is implemented correctly.

Cluster sampling dataset.

Cluster sampling was implemented by the algorithm explained on the previous sections in Python platform. In Table 3 the result of the cluster sampling method is shown:

Table 3. Cluster sampling dataset.

N Product ID Type Air tempe -rature Proces s tempe rature Rotation al speed Torqu e Machin e failure HDF PWF

1 M14S6 О M 298.1 308.6 1331 42.S О О О

V L4V1S6 L 298.1 308.6 133S 42.4 О О О

S L4V1SV L 298.1 308.6 132 V 4О.2 О О О

On the Table above it is seen that values are selected differently from random sampling, because in cluster sampling it uses dividing by the cluster group technique and then only the randomize the selected clusters. Cluster sampling method also have generated a new dataset with range R = 10x200. That range is more specific and convenient while using it in further optimization techniques. Here is the 3D scatter representation of the cluster sampled dataset (Figure 6).

Figure 6. The 3D scatter illustration of the cluster sampled dataset.

In this Figure 6 it is shown that the values of the sensors are located separate but also huddled together in some places as they sampled by clusters. It is necessary to admit that the cluster selected group are randomized before getting the final result, that's why the values on the Figure 6 are separated as random sampling.

Systematic sampling dataset

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Systematic sampling was also implemented in Python platform with explained algorithm. Here is the Table 4, where systematic sampled data is presented.

Table 4. Systematic sampling dataset.

N Product ID Type Air temperature Process temperature Rotational speed Torque Machine failure HDF PWF

8 L47187 L 298.1 308.6 1527 40.2 0 0 0

18 M14877 M 298.7 309.2 1410 45.6 0 0 0

29 L47208 L 299.1 309.4 1439 44.2 0 0 0

By applying systematic sampling on the initial dataset, it was given a new different dataset with the range R = 10x200 also. In systematic sampling there is need to divide the initial dataset by the specific interval. The 3D scatter plot of the systematic sampled dataset is shown in the Figure 7.

Systematic sampled data

Machine failure

• Air temperature • Process temperature

Figure 7. The 3D scatter illustration of the initial dataset.

The systematic sampling is close to the random sampling by the logic but the sampling is implemented by the specific number or character, so each there is more probability that all the values from every condition can be selected.

Results and discussion.

After implementing the sampling techniques on the initial dataset, were developed the classifiers based on PSO and ensemble methods, which results are provided in this section.

Development of a classifier based on the PSO algorithm.

After applying those three methods of sampling, it is necessary to apply particle swarm optimization on each subgroup of dataset. By comparing them with each other

it could be possible to investigate which method of sampling gets better results with PSO techniques. The Figure 8 represents the implementation of the PSO process:

Figure 8. PSO modeling process.

In equipment diagnostic systems it is crucial to determine influence factors of the dataset, because not all the equipment values cause a failure. In PSO modeling the weighting algorithm is used to determine the important factors, which are also presented in Figure 8.

Development of a classifier based on an ensemble bagging method. In Figure 10 represented the bagging modeling process, where except the bagging were used cross validation function to connect with the dataset and its labels.

Figure 10. Ensemble bagging modeling process.

In the training of the method are used 100 till 200 training cycles to observe how fast could the method learn by time. All the applied methods' results shown in the next section.

Modelling results and metrics.

At first, in the Table 5 there are the results of the PSO and ensemble bagging methods divided by each sampled dataset.

Table 5. The results of the PSO and Ensemble bagging methods implemented with initial and sampled datasets.

Dataset Method type Accuracy, % Classification error, % Recall, % Precision, %

Initial PSO 77,63 34 76 81

Ensemble (bagging) 89,72 28 80 83

Random sampled PSO 81,88 23 83,06 83,76

Ensemble (bagging) 90,54 13 89 91

Cluster sampled PSO 85,73 19 89,42 82,57

Ensemble (bagging) 91,62 8 92,07 89,5

Systematic sampled PSO 89,17 21 85 89

Ensemble (bagging) 93,808 11 88 91

The ROC (Receiver Operating Characteristic) comparison is also represented, because it is efficient way to compare the classification methods. The area under the ROC curve (AUC-ROC) is used to quantify a classifier's overall performance. A higher AUC value (closer to 1) suggests that the model distinguishes between positive and negative cases more accurately.

Figure 11. ROC thresholds of a) initial dataset and b) random sampled dataset by

ensemble bagging model.

Figure 12. ROC thresholds of c) cluster sampled dataset and d) systematic sampled

dataset by bagging model.

In Figure 11 and 12, there is seen that the b) random sampled and d) systematic sampled data's bagging model are more efficient and faster tends to one in its learning model.

Comparing different models: In machine failure prediction research, numerous models are frequently constructed and tested to determine the best successful strategy. The ROC comparison method allows researchers to objectively examine and compare the performance of various models. Researchers can decide which model has the best prediction ability by studying its ROC curves and AUC values.

Ensemble VOTE.

Ensemble voting combines the strengths of various distinct models, which may mitigate the faults of any single model. Ensemble approaches, which aggregate forecasts from multiple models, frequently produce more accurate predictions than any particular model alone. In our research were used Neural net, Gradient Boosted trees and Naïve Bayes as a component of the ensemble vote method. In Figure 13 is shown the modeling process of ensemble vote technique.

Figure 13. Ensemble Vote modeling process.

On top of the predictions made by the basic learners in its subprocess, this operator applies a majority vote (for classification) or an average (for regression). The Ensemble voting results with initial and sampled dataset are shown in the Table 6.

Table 6. The results of the Ensemble vote method implemented with initial and sampled datasets.

Dataset Method type Accuracy, % Classification error, % Recall, % Precision, %

Initial Ensemble (vote) 86,94 19 89,07 85,78

Random sampled Ensemble (vote) 92,20 14 90,06 91,66

Cluster sampled Ensemble (vote) 90,44 11 91,2 91,66

Systematic sampled Ensemble (vote) 93,6 8 94,32 93,87

From the above table it is clear that the ensemble vote methods are more effective than the other, also the combination of the vote method with the systematic sampling techniques gives the best result for the machine failure prediction experiment.

Furthermore, for better experiment results comparison in this research the ROC comparison techniques were extracted from the implemented vote model. The ROC (Receiver Operating Characteristic) comparison is an important technique for evaluating the performance of machine learning models, especially in tasks such as machine failure prediction. It does a thorough examination of the trade-offs between true positive rate (sensitivity) and false positive rate (1-specificity) at various thresholds. Here is the ROC comparison of the best given result - vote model with systematic dataset is shown in the Figure 14.

Figure 14. ROC comparison of vote model with systematic sampled dataset.

In this research was implemented Naïve Bayes and Gradient Boosted Trees in the Ensemble vote modeling. Here is the results of the ROC comparison below, where also added the Last Large Margin, Deep learning and Random forest methods just for the comparison (Figure 15).

Figure 15. ROC comparison of various types of classification.

As it's illustrated, the ROC curve is a graphical representation of a classification model's performance under different threshold settings. The ROC curve for models below rises sharply from the lower-left corner, showing that it achieves high true positive rates (sensitivity) while maintaining relatively low false positive rates (1-specificity) across various threshold settings. This shows that Naïve Bayes and Gradient boosted trees are highly predictive and successfully distinguishes between positive and negative examples.

Finally, in this research, various machine-learning algorithms for predicting equipment failure in industrial production were studied. Throughout our research, a variety of predictive modeling strategies were investigated, including particle swarm optimization (PSO), ensemble bagging, and ensemble voting. Each technique was assessed based on its capacity to distinguish between normal functioning and failure occurrences, with an emphasis on maximizing predicted accuracy and resilience. Our experiment shows that, while PSO and ensemble bagging showed promise in capturing underlying patterns in the data and generating individual predictive models, the

ensemble vote modeling approach emerged as the most effective and reliable method for machine failure prediction in our setting.

The ensemble vote modeling strategy, which combines the predictions of Neural net, Gradient boosted trees and Naïve Bayes models via a voting process, outperformed other methods tested.

Conclusion.

Industrial automation systems are characterized by a large amount of production data generated in real time. For example, the scanning cycle of a programmable logic controller averages 20 ms, if the control loop contains 200 points, then the automation system reads 12,000 data per minute. Most of the generated data is archived and is not used to predict the condition of equipment due to the large dimension of the data. Thus, the development of new and improved classification models using different data samples is relevant. Industrial equipment has its own operating specifics and the task of data sampling is to preserve the properties and dynamics of the control object as much as possible. The scientific novelty of this research is in the development of an improved classifier based on systematic sampling data and the construction of an ensemble of models, including Neural Net, Gradient boosted tree, Naïve Bayes.

Throughout the research were completed next operations:

- The initial database was processed using three different data samples: Simple random, cluster and systematic sampling,

- The best data sampling method was selected in combination with which the classifier achieves the best results in modeling,

- The particle swarm method and ensemble models were chosen as the classifier. In the process of studying the properties of algorithms, their advantages and disadvantages, the particle swarm method was chosen as the most suitable for working with specific production data. Several types of ensemble construction based on Bagging and Voting models were considered. It was proven that the database after using Systematic sampling and the ensemble with the voting type showed the best

results. Building an ensemble allows you to compensate for the shortcomings of the previous algorithm with the advantages of the next one in the ensemble. A comparative analysis of the application of these methods based on metrics was carried out. It has been proven that a classifier based on the systematic data sampling and an ensemble with a voting type is the most effective for diagnosing industrial equipment.

Ensemble voting substantially reduced the limits of individual models, such as overfitting and model variability, while improving prediction accuracy and generalization capabilities. Finally, the final classifier provides a robust and dependable framework for preventative maintenance methods by leveraging the collective wisdom of multiple models, resulting in improved operational efficiency, reduced downtime, and increased productivity in industrial settings.

СПИСОК ЛИТЕРАТУРЫ:

1. J. Jiao, M. Zhao, J. Lin and C. Ding, I E E E Trans. Ind. Electron, 66, 92-98, (2019). https://doi.org/10.1109/TIE.2019.2902817;

2. Y. Lei, F. Jia, J. Lin, S. Xing and S. X. Ding, I E E E Trans. Ind. Electron, 78, (2019). http://dx.doi.org/10.1109/TIE.2016.2519325;

3. Dimitar P. Filev, Ratna Babu Chinnam, Finn Tseng and Pundarikaksha Baruah, I E E E Transactions on Industrial Informatics, 6, 4, (2010). https://doi.org/10.1109/TII.2010.2060732;

4. V. Venkatasubramanian, R. Rengaswamy, K. Yin and S. N. Kavuri, Computers & Chemical Engineering, Part I: Quantitative model-based methods, 27, no.9, 293-311, (2003);

5. I. Hwang, S. Kim, Y. Kim and C. E. Seah, I E E E Transactions on Control Systems Technology, 18, 3, 636-653, (2010);

6. Y. Lei, J. Lin, Z. He and M. J. Zuo, Measurement journal, 35, 1,108-126, (2014). http://dx. doi. org/ 10.1016/j .measurement.2013.11.012;

7. P. Henriquez, J. B. Alonso, M. A. Ferrer and C. M. Travieso, I E E E Transactions on Systems, Man, and Cybernetics: Systems, 44, 5, 642-652, (2014);

8. R. Yan, R. X. Gao and X. Chen, Signal Processing, 96, 1-5, (2014);

9. V. Venkatasubramanian, R. Rengaswamy, S. N. Kavuri and K. Yin, Compuers & chemical engineering, Part III: Process history based methods, 27, no.3, 327-346, (2003);

10. S. Yin, S. X. Ding, X. Xie and H. Luo, I E E E Transactions on Industrial Electronics, 61, 11, 6418-6428, (2014);

11. S. X. Ding, Journal of Process Control, 24, no.2, 431- 449, (2014);

12. Z. Gao, C. Cecati and S. X. Ding, I E E E Transactions on Industrial Electronics, Part I: Fault diagnosis with model-based and signal based approaches, 62, no.6, 3757-3767, (2015);

13. Z. Gao, C. Cecati and S. X. Ding, I E E E Transactions on Industrial Electronics, 62, 6, 3768 3774, (2015);

14. Andrew A. Evstifeev and Margarita A. Zaeva, 190, 241-245. Elsevier B.V., (2021);

15. Dimitris Mourtzis, John Angelopoulos, and Nikos Panopoulos, 54, 166- 171, Elsevier B.V., (2020);

16. Saeed Rajabi, Mehdi Saman Azari, Stefania Santini, and Francesco Flammini, Expert Systems with Applications, 206, (2022);

17. Teerawat Thepmanee, Sawai Pongswatd, Farzin Asadi, and Prapart Ukakimaparn. Implementation of control and scada system: Energy Reports, no.8, 934-941, (2022);

18. Dietterich, T. G. (2000). Ensemble methods in machine learning. Multiple Classifier Systems, 1857, 1-15;

19. Zhou, Z. H. (2012). Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC;

20. Kuncheva, L. I. (2004). Combining Pattern Classifiers: Methods and Algorithms. John Wiley & Sons;

21. Rokach, L. (2010). Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2), 1-39;

22. Kadi Mohamed, Amine Naim, Akkouche Naim, AkkoucheSary, AwadSary and Awad Show, (2019), http://dx.doi.org/10.1016Zj.heliyon.2019.e02146;

23. Imran Rahman Pandian, Vasant Balbir, Singh Mahinder and Abdullah-Al-Wadud, Alexandria Engineering Journal, (2016). http://dx.doi.org/0.1016/j.aej.2015.11.002;

24. Ponni Ponnusamy and Prabha Dhandayudam, Journal of Electrical Engineering and Technology journal, (2023);

25. Ali Aldrees, Hamad Hassan Awan, Arbab Faisal and Abdeliazim Mustafa Mohamed, Process Safety and Environmental Protection journal, (2022);

26. Sinem Bozkurt and Kemal Keskin, (2022);

27. Sriparna Saha and Asif Ekbal. Combining multiple classifiers using vote based classifier ensemble technique for named entity recognition. Data & Knowledge Engineering journal, (2018)

THE INFLUENCE OF DATA SAMPLING ON SOLVING THE PROBLEM OF PATTERN RECOGNITION FOR DIAGNOSTICS OF INDUSTRIAL EQUIPMENT Текст научной статьи по специальности «Медицинские технологии»

Аннотация научной статьи по медицинским технологиям, автор научной работы — Samigulina Z.I., Baikadamova S.S.

Похожие темы научных работ по медицинским технологиям , автор научной работы — Samigulina Z.I., Baikadamova S.S.

Текст научной работы на тему «THE INFLUENCE OF DATA SAMPLING ON SOLVING THE PROBLEM OF PATTERN RECOGNITION FOR DIAGNOSTICS OF INDUSTRIAL EQUIPMENT»