DOI: 10.24412/2413-2527-2022-432-36-40
Russian version of the article © V. A. Prourzin, O. V. Prourzin is published
in Intelligent Technologies in Transport, 2022, No. 1 (29), Pp. 34-38. DOI: 10.24412/2413-2527-2022-129-34-38.
Algorithms for Big Data Analysis of Reliability of Recoverable Multichannel Systems
PhD V. A. Prourzin Institute for Problems in Mechanical Engineering of the Russian Academy of Sciences Saint Petersburg, Russia proursin@gmail. com
Abstract. Computer systems for monitoring the technical condition of transport accumulate, among other things, large data on the reliability of individual devices and elements. This allows us to calculate system reliability metrics without resorting to costly testing. Methods for analyzing big data of reliability of recoverable multichannel systems are considered here. Big data contains values of mean time between failures and values of the recovery time of system elements obtained by monitoring the functioning of similar systems during operation. The distribution laws of failures and restorations of system elements are unknown and can be arbitrary. Algorithms for assessing the reliability indicators of recoverable systems are considered, taking into account the diversity, unreliability and variability of data. In the case of monotonic systems with independent recovery of elements, the estimation of the availability factor and the mean time between failures of the system is reduced to evaluating the mean time to failure and the mean time to recover each element of the system for arbitrary distribution laws.
Keywords: computer monitoring, big data, multichannel system, robustness, availability factor, mean time between failures, mean time to recovery.
Introduction
Complex systems, from the point of view of the theory of reliability, are a set of technical devices interacting during operation and interconnected. A large number of scientific publications are devoted to mathematical models of reliability of complex, multichannel, cluster systems (see, for example, [1-7]).
An important issue is the assessment of the reliability characteristics of a complex system as a whole: mean time between failures and the availability factor. There is a well-known logical-probabilistic approach [4, 6], based on the representation of system failures and restorations as random binary events. System failures and recoveries depend on a number of primary binary random events (failure and recovery of elements). To date, a sufficiently effective apparatus has been developed for solving problems of this kind. This requires information about the distribution laws of the operating time to failure and the recovery time of each element.
Traditionally, the assessment of the reliability indicators of elements is carried out as a result of tests, which are characterized by high cost and time-consuming. On the other hand, computer monitoring of the operation of existing technical objects makes it possible to collect a huge database of reliability indi-
PhD O. V. Prourzin Emperor Alexander I St. Petersburg State Transport university Saint Petersburg, Russia [email protected]
cators, in particular, data on operating time to failure and recovery time. The approaches and methods of working with such huge databases constitute the content of computer technologies for working with big data [8]. The variety, validity, and volatility of big data are at the heart of big data analytics. The main problem of analyzing operational data of system reliability is, firstly, that the values are obtained at different loads and different laws of distributions of failures and restorations, and, secondly, in the presence of unreliable and abnormal data.
The purpose of this work is to develop computer methods for assessing the main indicators of the reliability of recoverable multichannel systems, namely, the availability factor, the mean time between failures and the mean time to recover the system. The indicators are estimated according to the data of computer monitoring of failures and restorations of operation of similar products. This takes into account a variety of real laws of distribution of failures and restorations of elements, a variety of operational loads and the presence of unreliable data.
Statement of the problem
A model of a system consisting of n nodes (elements) is considered. The nodes form a monotonous structural diagram of the system operability, for example, a circuit with a series-parallel connection. The failures and restorations of each node are independent and form an alternating restoration process with some distribution functions.
Let, as a result of monitoring the operation of the system itself or the analogs of the nodes under consideration, a set of data on the operating time to failure and the duration of the restoration of each element of the system is obtained. For the j-th element, the values of the operating time to failure tj, j = 1, ..., n; i = 1, ..., Nj and the values of the duration of repair sji, j = 1, ., n; i = 1, ., Mj are given. The laws of distributions of failures and restorations, as well as data on operational loads, are unknown.
Further, we will proceed from the fact that the operating conditions of the systems under consideration are regulated and, in general, they can be considered close. This allows us to assert that the data on the reliability of technical products obtained from various sources will, on average, be homogeneous in terms of operating conditions. Data associated with non-standard operating conditions and other abnormal data should be
Intellectual Technologies on Transport. 2022. No 4
identified and excluded during rejection when analyzing the entire sample.
The problem is to assess the system availability factor K, the mean time between failures Tc of the system, and the mean time to restore TR the system.
Methods for solving the problem
Rejecting of anomalous data. The main problem facing the big data scientist is cleaning and normalizing this data. in the case under consideration, this problem consists, first of all, in the rejection of samples generated by too small or too large loads.
Let us consider the problem of estimating the mean operating time to failure T of a certain element of the system based on a sample of values of its operating time to failure: {t1: t2, ■■■, tN} . In the presence of unreliable data and «drift» of the distribution laws that generate the data, the estimate of the sample position parameter (mean value), produced using the arithmetic mean, is unstable. To solve this problem, anomalous data rejecting procedures and methods of robust estimation of the sample position parameter are used [9, 10].
The simplest classical algorithm for rejecting a sample value t that is suspicious of an outlier is called the 3-sigma rule. A sample element t is considered anomalous if the following inequality is fulfilled: |t - f| < 3s, where t = ti is the
sample mean, and s = ^-Ef=1(ti - t)2 is the standard deviation.
Comparatively new approaches to solving the problem of rejecting anomalous data are based on algorithms for exploratory analysis of Tukey's data, namely Tukey's boxplot and its modifications [10]. The lower tL and tv upper rejection thresholds in the Tukey boxplot are set as follows:
tL = max jt(1); LQ - 2IQR
tu = min {t(w); UQ +3 IQRj.
Here t(1) and t(N) are the extreme ordinal statistics of the sample (the k-th ordinal statistic is the k-th order value in the initial sample sorted in ascending order), IQR = UQ - LQ is the sample interquartile latitude, LQ = t([W/4]), UQ = t([N-N/4]) are the sampled lower and upper quartiles. The rejection rule is: the value t is abnormal if t < tL or t > tu.
Robust position parameter estimates. In the statistical analysis of big data, robust estimation methods are used to ensure the stability of the position parameter estimate [10]. Robustness is a property of a statistical procedure to be resistant to uncontrolled deviations from the accepted data distribution models.
The two-stage robust estimation procedure is as follows. At the first stage, outliers are rejected using the three sigma rule or Tukey's boxplot. At the second stage, the position parameter is estimated by calculating the sample mean for the remaining sample elements.
There are known methods for estimating the position parameter that are resistant to the presence of outliers-robust methods of mathematical statistics. The simplest known robust position parameter estimate is the sample median:
Kfc),
N = 2k + 1;
T = medfo} = i t(k) + t,
Kfc+D
, N = 2k.
(1)
A well-known approach to constructing robust estimates was proposed by Huber, which is based on the minimax principle of constructing the best solution in the worst situation. Estimation of the position parameter according to Huber:
f =
n1)k + ^
Hi-T |£fc
ti
(2)
where k is the value that is allowed as a deviation from the center of the population (for example, we can take k = 1.5s); n1 is the number of observations from the sample lying in the interval (to, f - k) ; n2 is the number of observations lying in the interval (f + k, to).
When calculating according to formula (2), the usual arithmetic mean or median (1) can be used as an initial estimate f. Then, at each iteration, the sample is divided into three parts and calculated using formula (2) until the procedure converges.
Using the described methods gives us stable unbiased estimates of the mean time Tj to failure of the j-th element of the system. Estimates of the average recovery time Sj of the j-th element of the system are obtained in a similar way. Based on these values, an estimate of the availability factor of the j-th element is constructed, which does not depend on the type of the laws of distribution of failures and restorations [6]:
Ti
k = 1 Tt + St
(3)
'1
Logical structural function of the system performance.
When analyzing the reliability of complex systems, it is convenient to use the structural diagrams of the system operability [4, 6, 7]. For example, if a system failure occurs when at least one element fails, then such a scheme will be a circuit of sequential connection of elements. If the system is operational when at least one element is operational, then we have a case of parallel connection of elements (loaded reserve circuit). We can also consider more complex circuits, including serial and parallel subsystems, a bridge connection (Figure 1).
- 2 - 6 -
3 - 9 -
5
7 9 8
8
7 6 5
- 4 - 11 -1
- 1 - 4 -
5 - 6 - 7
2 - 6 -
5 " 4 " 8 -1
- 9 " 14 J
Fig. 1. An example of a structural diagram of the operability of a monotonic system
1
3
8
7
Intellectual Technologies on Transport. 2022. No 4
A system failure is a random event described by a binary (boolean) variable X, which takes one of two values: 0 (failure) or 1 (operation). This event depends on n simple independent events described by binary variables xj (operation or failure of i-th element). A logical structural function of the system performance is introduced, which specifies the dependence of the state of the system X on the states of its elements:
X = q>(xi, X2, -, xn).
For example, for a circuit of n series-connected elements (Figure 2), the structure function is the product of all binary variables: X = X1X2 "'Xn.
x2 xn
Fig. 2. A structural diagram of the operability of n series-connected elements
For a circuit of n parallel-connected elements (Figure 3), the structure function is
X = 1 - (1-x1)(1-x2)...(1-xn).
xi
x2
xn
Fig. 3. A structural diagram of the operability of n parallel-connected elements
In what follows, only systems whose structural functions have the monotonicity property are considered [7].
The following important results are known [7] for monotonic systems with independent failures and restorations of elements. Let for each element of the system the mean values of time to failure Tj, mean values of the recovery time Sj and the corresponding values of the availability factor Kj are known. Then:
1. The availability factor K of a monotonic system is equal to the value of the structural function of the availability factors of the system elements
K = q>(Ki, K2, -, Kn).
(4)
2. Mean time between failures Tc of a system with individual independent recovery of elements is calculated by the formula
Tc = £ = ^ (p(Ki, K2, -, Kn), Kc Ac
where Ac is the reduced rate of system failures:
(5)
n 1
1=1
3. The average system recovery time is determined by the following expression:
Tj> = Tr
1 -K = T 1 - <p(ffi, K2.....Kn)
K c <(Ki, K2.....Kn) .
(6)
All the above expressions do not depend on the form of the laws of distribution of failures and restorations of elements.
Example. Let a cluster computing system consist of three identical computers working in parallel, n = 3. The structural function of the system is as follows:
q(Xi, X2, X3) = 1 - (1 - Xi)(l - X2)(1 - X3).
Let the mean time to failure of one computer be equal to a year: T0 = 8 760 hours, the mean recovery time after failure is equal to a calendar month: S0 = 720 hours.
Using formulas (3)-(5), we obtain the following. Availability factor of each computer is
8 760
= 0.9240506329.
K° 8 760 + 720 The system availability factor is
K = 1 - (1 - K0)3 = 0,9995619008. Reduced rate of system failures: 3
(1 - (1 - (1 - K0)2)) = 1.8254 x 10-6 h-1
T0 + S0
Average system time between failures
Tc = K/Xc = 547 581 h.
Average system recovery time TR = 240 h.
Algorithm for solving the problem
1. Determination of the composition of the elements included in the system, construction of the operability diagram and the logical structural function of the system operability.
2. Extraction of operating time values and recovery times of these elements from the big data of monitoring system elements. Drawing up initial samples {tji} and {s^}.
3. Obtaining robust estimates of the position parameters for each sample: mean time to failure Tj and mean recovery time Sj. For this, either a two-stage estimation procedure or robust methods (1), (2) are used.
4. Calculation of the availability factor Kj of each element according to the formula (3).
5. Calculation of the system availability factor K according to the formula (4).
6. Calculation of the mean time between failures Tc of the system according to the formula (5).
7. Calculation of the average recovery time Tr of the system according to the formula (6).
Conclusion
Computer methods for monitoring the technical state of complex systems provide us with data on the reliability of these systems during operation. This data is a huge amount of information. The analysis and processing of such volumes of information make up the content of big data science. Here, methods for assessing the main indicators of the reliability of recoverable systems are considered under the conditions of a variety of real
c
laws of distribution of failures and restorations of elements, a variety of operational loads and the presence of unreliable data. Algorithms for assessing the availability factor, mean operating time between failures and mean time to restore the system based on real data from the operation of system elements are presented. It is shown that in this case it is not required to evaluate the laws of distribution of failures and restorations of elements.
References
1. Shooman M. L. Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design. New York, John Wiley & Sons, 2002, 552 p.
2. Cherkesov G. N. Nadezhnost apparatno-programmnykh kompleksov: Uchebnoe posobie [Reliability of hardware and software systems: Study guide]. Saint Petersburg, Piter Publishing House, 2005, 479 p. (In Russian)
3. Gurov S. V. Analiz nadezhnosti tekhnicheskikh sistem s proizvolnymi zakonami raspredeleniy otkazov i vosstanovleniy [Analysis of The Reliability of Technical Systems with Arbitrary Laws of Distribution of Failures and Restorations], Kachestvo i nadezhnost izdeliy: sbornik statey [Quality and Reliability of Products: Collection of Articles], 1992. No. 2 (18), Pp. 3-37. (In Russian)
4. Prourzin V. A. Techno-Economic Risk in Designing Complex Systems: Algorithms for Analysis and optimization, Automation and Remote Control, 2003, Vol. 64, No. 7, Pp. 1054-1062. DOI: 10.1023/A:1024773916089.
5. Prourzin V. A. The Dynamic Reliability Model under Variable Loads and Accelerated Tests, Journal of Machinery Manufacture and Reliability, 2020, Vol. 49, No. 5, Pp. 395-400. DOI: 10.3103/S1052618820050118.
6. Ryabinin I. A., Cherkesov G. N. Logiko-veroyatnostnye metody issledovaniya nadezhnosti strukturno-slozhnykh sistem [Logical-probabilistic methods for studying the reliability of structurally complex systems]. Moscow, Radio and Communications Publishers, 1981, 264 p. (In Russian)
7. Beichelt F., Franken P. Nadezhnost i tekhnicheskoe ob-sluzhivanie. Matematicheskiy podkhod [Reliability and maintenance. Mathematical approach]. Moscow, Radio and Communications Publishers, 1988, 392 p. (In Russian)
8. Leskovec J., Rajaraman A., Ullman J. D. Analiz bolshikh naborov dannykh [Mining of Massive Datasets]. Moscow, DMK Press, 2016, 498 p. (In Russian)
9. Barnett V., Lewis T. Outliers in Statistical Data. Third Edition. Chichester, John Wiley & Sons, 1994, 601 p.
10. Shevlyakov G. L., Vilchevski N. O. Robustness in Data Analysis: Criteria and Methods. Utrecht, VSP Publishers, 2002, 318 p.
DOI: 10.24412/2413-2527-2022-432-36-40
Русскоязычная версия статьи © В. А. Проурзин, О. В. Проурзин опубликована в журнале «Интеллектуальные технологии на транспорте». 2022. № 1 (29). С. 34-38. Б01: 10.24412/2413-2527-2022-129-34-38.
Алгоритмы анализа больших данных надежности восстанавливаемых многоканальных систем
к.ф.-м.н. В. А. Проурзин Институт проблем машиноведения Российской академии наук Санкт-Петербург, Россия [email protected]
Аннотация. Компьютерные системы мониторинга технического состояния транспортных систем накапливают в том числе и большие данные по надежности отдельных устройств и элементов. Это позволяет вычислить показатели надежности систем, не прибегая к дорогостоящим испытаниям. Здесь рассмотрены методы анализа больших данных надежности восстанавливаемых многоканальных систем. Большие данные содержат значения наработки до отказа и значения времени восстановления элементов системы, полученных при мониторинге функционирования аналогичных систем в процессе эксплуатации. Законы распределения отказов и восстановлений элементов системы неизвестны и могут носить произвольный характер. Рассмотрены алгоритмы оценки показателей надежности восстанавливаемых систем с учетом разнообразия, недостоверности и изменчивости данных. В случае монотонных систем с независимым восстановлением элементов оценка коэффициента готовности и средней наработки между отказами системы сводится к оценке средней наработки до отказа и среднего времени восстановления каждого элемента системы для произвольных законов распределений.
Ключевые слова: компьютерный мониторинг, большие данные, сложная система, робастность, коэффициент готовности, средняя наработка между отказами, среднее время восстановления.
Литература
1. Shooman, M. L. Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design. — New York: John Wiley & Sons, 2002. — 552 р.
2. Черкесов, Г. Н. Надежность аппаратно-программных комплексов: Учебное пособие. — Санкт-Петербург: Питер, 2005. — 479 с.
3. Гуров, С. В. Анализ надежности технических систем с произвольными законами распределений отказов и восстановлений // Качество и надежность изделий: сборник статей. 1992. № 2 (18). C. 3-37. — (В помощь слушателям лекций Консультационного центра по качеству и надежности).
к.т.н. О. В. Проурзин Петербургский государственный университет путей сообщения Императора Александра I Санкт-Петербург, Россия [email protected]
4. Prourzin, V. A. Techno-Economic Risk in Designing Complex Systems: Algorithms for Analysis and Optimization // Automation and Remote Control. 2003. Vol. 64, No. 7. Pp. 1054-1062. DOI: 10.1023/A:1024773916089.
5. Prourzin, V. A. The Dynamic Reliability Model under Variable Loads and Accelerated Tests // Journal of Machinery Manufacture and Reliability. 2020. Vol. 49, No. 5. Pp. 395-400. DOI: 10.3103/S1052618820050118.
6. Рябинин, И. А. Логико-вероятностные методы исследования надежности структурно-сложных систем / И. А. Рябинин, Г. Н. Черкесов. — Москва: Радио и связь. Редакция литературы по радиоэлектронике, 1981. — 264 с. — (Библиотека инженера по надежности).
7. Байхельт, Ф. Надежность и техническое обслуживание. Математический подход = Zuverlässigkeit und instandhaltung. Matematische methoden / Ф. Байхельт, П. Фран-кен; перевод с нем. М. Г. Коновалова; под ред. И. А. Ушакова. — Москва: Радио и связь. Редакция переводной литературы, 1988. — 392 с.
8. Лесковец, Ю. Анализ больших наборов данных = Mining of Massive Datasets / Ю. Лесковец, А. Раджараман, Д. Д. Ульман; перевод с англ. А. А. Слинкина. — Москва: ДМК Пресс, 2016. — 498 p.
9. Barnett, V. Outliers in Statistical Data. Third Edition / V. Barnett, T. Lewis. — Chichester: John Wiley & Sons, 1994. — 601 p.
10. Shevlyakov, G. L. Robustness in Data Analysis: Criteria and Methods / G. L. Shevlyakov, N. O. Vilchevski. — Utrecht: VSP Publishers, 2002. — 318 p. — (Modern Probability and Statistics).