DECREASE IN RISK ERRONEOUS CLASSIFICATION THE MULTIVARIATE STATISTICAL DATA DESCRIBING THE TECHNICAL CONDITION OF THE EQUIPMENT OF POWER SUPPLY SYSTEMS
Farhadzadeh E.M., Farzaliyev Y.Z., Muradaliyev A.Z.
•
Azerbaijan Scientific-Research and Design-Prospecting Institute of Energetic AZ1012, Ave. H.Zardabi-94, e-mail :fem1939@rambler. ru
ABSTRACT
Objective estimation of parameters of individual reliability is an indispensable condition of an opportunity of decrease in operational expenses for maintenance service and repair of the equipment and devices of electro power systems. The method of decrease in risk of erroneous classification of multivariate statistical data offered. The method based on imitating modeling and the theory of check of statistical hypotheses.
I. INSTRUCTION
Estimation parameters of individual reliability of the equipment of power supply systems provides classification of final population of multivariate statistical data of operation, tests and restoration of deterioration on the set versions of attributes (VA) [1].
VA reflects features of a design, a condition of operation, feature of occurrence of refusals and carrying out of repairs of the equipment. Expediency of classification on each of population VA is established by comparison of statistical functions of distribution (s.f.d.) final population of statistical data F*(X) and s.f.d. samples n random variables from this population on i versions of V attribute FV* i (X), where v=1, k; k-number of attributes of random variable X (for example, durations of emergency repair); i=1, rk; rk- number of versions k an attribute. If s.f.d. F*(X) and F^X) differ not casually, in other words, sample {X}n where n-number of random variables of
sample, it is not representative classification of data at an estimation of parameters of individual reliability is expedient and on the contrary. It is necessary to note, that unlike sample of a general data population (analogue: infinite set of random variables with uniform distribution in an interval [0,1]), which imposing appearance is set by some significance value a, sample of final population of multivariate data on set VA is not casual, as a matter of fact, and it can appear only representative. In particular, sample can appear representative, if for considered data set VA not significant.
II. RECOMMEND METHOD
In a basis of comparison F*(X) and F**(X) there is a statistical modeling (by means of
computer program RAND) n pseudo-random numbers random variables of sample equal to number, with uniform distribution in an interval [0,1].
Indispensable condition thus is consistency s.f.d. FV(£) to the uniform law of distribution Fz (£), in other words, casual character of distinction and FV (£). It is obvious, that from the uniform law of change of random numbers £ at all consistency does not follow the uniform law
s.f.d. FVV(^) with the set significance value a. Use at modeling statistical analogue FV(X) s.f.d. FV(£), essentially differing from Fs(£), leads to erroneous increase in value of the greatest divergence of distribution of this analogue F*(X) from F*(X) and by that to growth of probability of the erroneous decision at classification of data.
Representative character of sample {£}n at the decision of a problem of an estimation of expediency of classification of multivariate data it was supervised Kolmogorov's by criterion [2]. According to this criterion sample {£}n it is unpresentable, if
Dn > dn,(1-a)
Dn = max(D+ ,D- ) D+ = max{D+}; 1 < i < n
' n
D-= max{D-}; 1 < i < n
D-=ki -
i -1
(1)
where: D„ = max(D! ,D: ) (2)
(3)
(4)
(5)
(6)
dn,(i-a) - critical value of statistics Dn provided that Fs(£) and differ casually
In [3] it is marked, that estimation Dn under the formula
Dn = max {|D+| } 1 < i < n (7)
leads to incorrect decisions on a parity Fz (£) and FV(£).
The similar remark can be found and in [4]. The reason of such discrepancy does not stipulate. At uncertain in advance n, decrease in time of calculation, according to [3], is reached by application of exact approach Stephens, which tabulated critical values dn,(1-a), depending from n and a, reduces to dependence only from a. Sample {£}n it is unpresentable, if
where:
A • Dn > C1-a
A = |Vn + 0.12 +
0.11
(8) (9)
For example, at n=4 size A=2,175 and for a=0,1 critical value C1-a=1,224, and at a=0,05 size C1-a=1,358.
Application of a method of the decision of «a return problem» when it is in advance known, that sample {£}n it is unpresentable, has shown, that criteria (1) and (8) for values most often used in practice a=0,05 and a=0,1 not casual character of divergence Fs(£) and FV(£) at small n establish only for those cases when it does not raise the doubts. For acknowledgement of this statement, we shall consider a following example. Let random numbers y have uniform distribution Fs(y) in an interval [0.5; 1]. Casual sample is set {y}n with n=4: {0,86346; 0,50672; 0,91424 and 0,67210}. Check up the assumption of imposing appearance of this sample for the uniform law of distribution of a random variable £ in an interval [0,1].
Results of calculations are resulted in table 1.
Table 1
_Example of an estimation of imposing appearance of sample_
i Fs (V i) i/n D+ D- The note
1 0.507 0.25 -0.257 +0.506 D+ = 0.086; D- = 0.506
2 0.672 0.5 -0.172 +0.422 Dn=0.506; Dn<d4- 0 9=0.565
3 0.863 0.75 -0.113 +0.363 ADn=1.101;
4 0.914 1.00 +0.086 +0.164 ADn<C0.9=1.224
As sample follows from table 1 {y}4 does not contradict the assumption of imposing appearance rather Fs (£) at a=0,1.
These features and some assumptions of the reasons of their occurrence [5] have demanded to pass from the analysis of absolute values of the greatest divergence of distributions Fs(£) and FV(£), to the analysis of the valid values of the greatest divergence (Stn). Thus under «the greatest divergence Fs(£) and FV(£)» we shall understand the greatest on the module vertical distance between Fs(£) and FV(£) with i=1, n.
Calculations Stn were spent according to the algorithm, integrated which block diagram is resulted in figure 1.
1 I
Modeling with i=1, n
2 I
Formation s.f.d.
_iy(E)
3
_i_
stnj = [Fs(y-F;(y]
i=l.n
Fig. 1. Block diagram of algorithm of calculation of the greatest divergence of distributions Fs(£) and FV(£)
Application of formulas of type
Stn = maxj^i - -j 1 < i < n (10)
calculation on the computer leads to erroneous results. For example, according to table 1 the maximal value among four realizations of size D+ will, D^ = 0.086, and the greatest vertical divergence between and F**(£) it is equal D+ = -0.256
Results of ordering of given realizations Stn presented in table 2 and allow concluding:
1. Quintile distributions F*(Stn)=a and n>2 are equal on size and are opposite on a sign (distinction in a sign is caused by distinction of formulas 4 and 10) quintiles distributions F(Dn)=2a {see tabl.16 [2]};
2. Distribution F*(Stn) is asymmetrical. In the illustrative purposes on fig. 2 are resulted s.f.d. F*(Stn) for of some n. The assumption of symmetry of distribution F(Stn) it is possible to explain discrepancy of probability practically equal quintile distributions F*(Stn) and F(Dn);
3. Than it is less, that negative value on sign Stn on size will be more, since Stn=(£n.1). On experimental data the least value Stn for n=2 has appeared equal Stn=-0,992, and the greatest Stn=+0,489 at sup equal, accordingly, 1 and 0,5.
Stn>Illin = rnin{Stn_L
Stn=Stn, I
~r
1
G.Tsitsiashvili - IMAGE RECOGNITION BY MULTIDIMENSIONAL INTERVALS 02 (29)
(Vol.8) 2013, June
Table 2
Some results of an estimation s.f.d. F*(Stn)
nSn) n \ 0,025 0,05 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 0,95 0,975
2 -0.842 -0.775 -0.684 -0.551 -0.473 -0.149 -0.363 -0.304 -0.239 -0.060 0.184 0.285 0.343
3 -0.709 -0.635 -0.566 -0.471 -0.400 -0.335 -0.296 -0.252 -0.200 -0.145 0.231 0.299 0.372
4 -0.623 -0.567 -0.494 -0.414 -0.355 -0.302 -0.253 -0.217 -0.173 0.155 0.240 0.319 0.377
5 -0.567 -0.511 -0.449 -0.370 -0.318 -0.274 -0.232 -0.190 -0.147 0.164 0.246 0.309 0.360
6 -0.523 -0.469 -0.411 -0.338 -0.292 -0.252 -0.215 -0.173 -0.127 0.171 0.244 0.303 0.358
7 -0.481 -0.438 -0.384 -0.318 -0.274 -0.235 -0.201 -0.162 -0.113 0.165 0.235 0.290 0.342
11 -0.389 -0.353 -0.309 -0.255 -0.219 -0.189 -0.110 -0.129 -0.097 0.160 0.216 0.260 0.302
16 -0.33 -0.295 -0.258 -0.215 -0.184 -0.158 -0.134 -0.103 0.107 0.150 0.194 0.232 0.264
22 -0.280 -0.253 -0.221 -0.183 -0.157 -0.135 -0.113 -0.083 0.105 0.137 0.176 0.210 0.235
29 -0.246 -0.219 -0.193 -0.160 -0.138 -0.119 -0.099 -0.068 0.098 0.126 0.158 0.186 0.212
40 -0.208 -0.187 -0.164 -0.136 -0.119 -0.102 -0.084 -0.050 0.089 0.112 0.140 0.164 0.185
60 -0.173 -0.156 -0.137 -0.114 -0.097 -0.083 -0.069 0.054 0.077 0.096 0.118 0.138 0.155
90 -0.142 -0.127 -0.111 -0.092 -0.079 -0.068 -0.055 0.051 0.067 0.081 0.100 0.116 0.130
120 -0.122 -0.110 -0.096 -0.080 -0.068 -0.059 -0.047 0.047 0.060 0.072 0.089 0.102 0.114
150 -0.110 -0.099 -0.086 -0.071 -0.062 -0.053 -0.042 0.041 0.053 0.065 0.079 0.092 0.104
Fig.2. S.f.d. F*(Stn) for of some n
4. In distribution F*(Stn) distinguish the bottom Stn and top Stn boundary values with a significance value a, i.e.
F* (Stn )=a/ 2 1
* = , r (n)
F* (Stn )= (1 -a 2)J
5. It is established, that if 0,25>F*(Stn)>0,75, i.e. if a<0,5
Stn = -^1 + Stn ) (12)
For example, for n=4 and a=0.10 according to distribution F*(Stn) (see tabl.2) sizeSt4 =-0.567, and St4 = +0.319 . At the same time under the formula (12)
-(0,25-0,567=0,317=St4 If n=29 and a=0,2, that Stn = -0.193 and St^ = 0.158. The size St^ under the formula (12) is equal - (0,034-0,193)=0,159
On fig. 3 histograms of distribution of negative and positive values Stn for n=4 and n=29 are resulted.
7COO 6Û00 5COO 4COO 3ÛÛÛ 2000
1000
11=4
5858
-4633-
-m-
1725'
3Û5 0 |73
1DD0
73 10
J1
t? t? / / ^ ß / „<? 0 ^ £
Fig.3. Histograms of distribution of the greatest divergence of distributions and FV(£)
As follows from fig. 3, negative values Stn essentially exceed positive values Stn on relative number and an interval of change. Proceeding from i. 3 it is clear, that it not casually and does not testify about unpresentable samples. With growth n the parity of negative and positive values Stn decreases and aspires to unit. For n=2 negative values Stn make 87,5%, and for n=29 - 61%, and for n=150 - 55%. Thus, even at n=150 quintile distributions F*(Stn) at a=0,05 and a==0,95 are not equal [-0.099; +0.092]. Histograms also explain laws of distribution F*(Stn) resulted on fig.2.
On fig. 4 curve changes of boundary values of statistics Stn for of some values s.f.d. are resulted, F*(Stn). Criterion of the control of imposing appearance of sample with a significance value a thus looks like:
st„<stn < stn (13)
0.4 -rStn
Fig.4. Laws of change of boundary values of the greatest divergence of distributions Fs(£) and
fV(£)
Let's designate positive values Stn throughSt^, and negative values- Stn In view of i.1. and the equations (12), sample |^}n with a significance value a<0,5 can be accepted representative, if
St ; <
d
n,(1-2a )
st ;
<d
n,(1-2a )
As
1
,st:+ n J=
1
n
(14)
st ;
criterion (13) for a significance value a can be presented, as
1 ' = dn,(1-2a ) (15)
st;
vSt:+n '=
Here it is necessary to pay attention to discrepancy of the equations of importance Stn and dn,(1-2a).
If again to address to data of table 1 it is easy to notice, that the interval criterion (13), allowing to consider a sign on the greatest divergence Stn, also is unable to establish unpresentable character of sample {v}n.
It is known, that decrease in risk of the erroneous decision at classification of data can be reached by the account not only errors I type, but also the II types [4].
The most simple decision of this problem would be comparison Stn between Fs(£) and F*(£) with boundary values of the interval [stn;Stn J corresponding a significance value a=0,5. It
is that limiting case of values a when Stn=0. Thus a errors II type P=(1-a), i.e. also it is equal 0,5. If a to accept it is less, than 0,5 the errors II type increases p. In real conditions:
- configurations Fs(£) also F*(£) are various, i.e. Stn^0;
- for the same value Stn size (a+P) less or it is equal to unit;
- in process of increase Stn size (a+P) decreases, reaches the minimum (Stn,opt) and then increases;
- if Stn<Stn,opt, then a>P, if Stn> Stn,opt, then a<P;
- distinction between a and P increases in process of increase in a divergence between Stn and
Stn,opt.
Comparison of realizations Stn to boundary values Stn andStn, calculated accordingly, for
F*(Stn) = 0.25 andF*(Stn) = 0.75, allows to not calculate s.f.d., which defines a errors II type P, that it is possible to carry to advantages of this way. Its lacks are necessity of increase twice numbers of modeled realizations of distribution F*(£), unjustified decrease in disorder Stn, the heuristic approach.
Algorithm of calculation s.f.d., describing the greatest deviation Fs(£) and F*(£), provided that F*(£) it is unpresentable, consists of following sequence of calculations:
1. It is modeled next (from necessary N realizations) their sample n random numbers;
2. It is formed s.f.d. F^(£);
3. The greatest divergence between F is defined Fs(£) and F*(£). Designate this size as Stn,e where the index «e» corresponds to empirical character of sample.
Having defined statistical characteristics of this sample {F*(£) and Stn,e}, start formation s.f.d. F*(Stn) on realizations of the greatest divergence between functions of distribution Fs(£) and
set (N) s.f.d. FV(v), modeled on s.f.d. FV(£). For what:
V VY/ 5 '"UUWW "" C.i.u. xv
4. On s.f.d. FV(^) distribution is formed
V
0 if v < Vi
fV(v) h
i -1 (v-vi) ^^
— Ti' if V1 <v<v n+1 (16)
n+1 (v i+1 -v i)(n +1)
1 if v > V n+1
5. Under standard program RAND the random number is modeled £ with uniform distribution in an interval [0,1];
6. On distribution (16) calculated corresponding probability E, random number y. Calculations are spent under the formula
V = Vi + (Vi+i - Vi)K • (n +1) - (i -1)]
(17)
with i=1, (n+1)
7. Items 5 and 6 repeat n time;
8. On sample {y}n is under construction s.f.d. FV(y);
9. The greatest divergence between FS(E) and FV(y) is defined. Designate it through St^;
10. Items (5^9) will repeat N time;
11. Average value of a random variable St^ defined. Designate it through M*(St^);
12. On N to values, S< it is formed s.f.d. F*(St^).
If to assume, that distribution F*(St^) corresponds to the normal law of distribution, average value M*(St^) is equal Stn,e and corresponds F*(St^) = P = 0,5, for all realizations Stn,e, which probability 0.1<a<0.5, the preference should be given to assumption H2. However, the assumption of the normal law of distribution of function F*(St^) mismatches the validity. As an example on fig.5 the histogram of distribution of realizations St^ for s.f.d. is resulted. FV(y), resulted in table 1.
Fig.5. Histogram of realizations St
Let's enter into consideration two assumptions: H1 - sample {y}n reflects laws of distribution FS(E); H2 - sample {y}n does not reflect law of distribution FS(E).
The recommended algorithm of decision-making depends on a parity of average values of realizations Stn and St^. In this connection the distribution describing risk of the erroneous decision in function Stn designate Sh1(Stn), and in function St^ - Sh2(Stn). At M*(Stn) < M*(S<)
Shl(Stn) = 1 - F*(Stn)|
Sh2(Stn) = F(Stn) Algorithm of decision-making looks like: IfStn,3 > Stn, then H2, else IfStn3 < Stn, then H1, else
IfSh1(Stn) << Sh2(Stn), then H2, Otherwise H1
(18)
At M*(Stn) > M*(StJ
Sh1(Stn) = 1 - F*(SO|
Sh2(Stn) = F*(Stn) Algorithm of decision-making looks like: IfStn > Stn, then Hi, else IfSt„ „ < St„ , then H2, else
(20)
n,э n
c
(21)
IfSh1(Stn) >> Sh2(Stn), then H2, Otherwise Hi
In the illustrative purposes on fig. 6 functions of distribution Sh1(Stn) and Sh2(Stn) are resulted, calculated according to table 1.
F'(StJ F'(SO
Fig. 6. Laws of change s.f.d. F*(Stn) and F*(SO for n=4: a - s.f.d. F*(Stn); b - F*(Stn)
As M*(Stn) it has appeared less than M*(Stn) functions of distribution Sh1(Stn). and Sh2(Stn). were calculated accordingly under the formula (18).
In table 3 numerical values of the parameters defining result of the decision are systematized. As follows from tab. 3 as Sh1(Stn,e)<<Sh2(Stn,e)., the preference, according to (19) is given assumption H2. In other words, attraction to the statistical analysis of size of a errors I type and errors II types, allows distinguish unpresentable samples.
Table 3
_The basic parameters of calculation_
Parameter Conditional designation Estimation
1. Number casual sample n 4
2. Average value of the greatest divergence of distributions Fs(^) m*(so -0,207
and FV(£)
3. Average value of the greatest divergence of distributions m*(sO 0,292
FV(V) and FV(y)
4. Empirical value of the greatest divergence of distributions Fs(^) Stn,e
and FVV(^) 0,257
5. Boundary values of an interval of change Stn c a=0.1 top bottom Stn
6. Boundary values of an interval of change Stn with a=0,01 top Stn 0,319 -0,567
bottom
7. Probability Stn,e on s.f.d. [1 - F*(Stn)]
on s.f.d. F*(SO
8. The assumption is accepted
St St
Sh1(Stn,e) Sh2(Stn,e) H
0,544
0,292 0,09 0,42 H2
*
n
It is necessary to note, that attraction to an estimation of character of a divergence of distributions and FV(y) distributions F*(Stn) for all realizations samples it is unjustified, as for of some from them, for example at Sh1(Stn).>0,5 sample (y}n it is most truly representative, and at Sh1(Stn)<0,1 - it is unpresentable.
There fore calculations s.f.d. F*(Stn) offered to spend for following conditions:
1. M*(Stn) <M*(Stn)
. *
Stn,0.05 < Stn, < Stn.0.95
Stn,0.25 * Stn, > Stn.0.75 f (22)
2. M*(Stn) > M*(Stn)
Stn,0.05 < Stn,э < Stn,0.95
Stn,0.25 > Stn,э > Stn,0.75
(23)
Critical values of statistics Stn for F*(Stn)=0,25 and average values M*(Stn) for N=25000 realizations Stn and of some n are resulted in table 4.
Table 4
Bottom boundary (Stn) and average M*(Stn) values of statistics Stn
N n Stn (F*(Stn)=0.25) M*(Stn) N n Stn (F*(Stn) =0.25) M*(Stn)
1 2 -0.498 -0.33 9 22 -0.17 -0.047
2 3 -0.435 -0.254 10 29 -0.149 -0.037
3 4 -0.385 -0.207 11 40 -0.127 -0.027
4 5 -0.343 -0.173 12 60 -0.105 -0.019
5 6 -0.312 -0.146 13 90 -0.086 -0.012
6 7 --0.294 -0.133 14 120 -0.074 -0.00-
7 11 -0.235 -0.87 15 150 -0.067 -0.008
8 16 -0.198 -0.063
The computer technology of an estimation of parameters of individual reliability assumes automation of process of classification of multivariate data. For what, as initial data boundary values of statistics Stn should entered. In this connection, by analogy to formulas (8) and (9), the opportunity of an estimation of dependence of boundary values Stn from n was of interest.
The equations of regress received under the standard program of sedate transformation, are characterized by factor of determination R2: (R2> 0.999) and for of some Sh1(Stn).=a/2 look like:
- for Shl(Stn)= 0,025 Stn = (123n0 52 - 1)/n = (B^052 - 1)/n (24) and Shl(St^) = 0,975 St„ = -1.23n-048 = - B^n048 (25)
- for Shl(Stn) = 0,05 St^ = (1.12n0 52 - 1)/n = (B2n052 - 1)/n (26)
and Shl(Stn ) = = 0,95 Stn = -l.l2n -048 = - Bjn048 (27)
for Shl(St7 ) = 0,l Stn = (0.98n052 - l)/n = (B3n052 - l)/n (28)
and Shl(Stn ) = = 0,9 Stn = -0.98n -048 = - B3/n°'48 (29)
for Shl(St~ ) = 0,25 StT = (0.75n052 - l)/n = (B4n052 - l)/n (30)
and Shl(Stn ) = = 0,75 Stn = -0.75n -048 = - Bjn0M (3l)
The equation of dependence of constant factors B from a with factor of determination R2: (R2> 0.993) looks like:
' /-\T—0.175
B = 0.652 Shl(Stn jf (32)
Thus, the bottom and top boundary values of statistics Stn in view of the equation (12) calculated under following formulas:
shi(s; r75. n -
(33)
Stn - 1 — n
Stn = -0.652
St" = -
For practical calculations Stn and Stn more often formulas (27) and (12) used.
CONCLUSIONS
1. The interval nonparametric criterion of the control of conformity samples from n pseudorandom numbers is offered to the uniform law in an interval [0,1];
2. In a basis of criterion there is a distinction of distributions of positive and negative values of the greatest divergence of distributions FS(Q and FV(£);
3. Transition from statistics Dn to statistics Stn allows not only to simplify algorithm of calculation greatest divergences FS(Q and FV(£), but also to estimate an opportunity of use of statistics Stn at an estimation of the greatest divergence s.f.d. F*(X) and FV(X), to estimate risk of the erroneous decision Sh1(Stn);
4. Increase of accuracy of the control of conformity of distribution St^ to the uniform law reached by practical realization of recommended algorithm of the decision-making considering not only a errors I type, but also the errors II type.
REFERENCE
1. Farhadzadeh E.M., Muradaliyev A.Z., Farzaliyev Y.Z. Quantitative estimation of individual reliability of the equipment and devices of the power supply system. Journal: «Reliability: Theory&applications. R&RATA (Vol.7 No.4 (27)) 2012, December., USA, p.53-62
2. Gnedenko B.V., Beljaev J.K., Solovyov A.D. Mathematical methods in the theory of reliability. "Science", 1965, 524 p.
3. Kelton B, Law A. Imitational modeling. Classics CS. 3 CP6.: Peter, Kiev: Publishing group BHV, 2004, 847 p.
4. Ryabinin I.A. The heart of the theory and calculation of reliability of ship electro power systems. Shipbuilding. 1971, 454 p.
5. Farhadzadeh E.M. Technique of a statistical estimation of critical values of empirical distribution from theoretical. «Methodical questions of research of reliability of greater systems of power» SEI SO SA USSR, 16, Grozny, 1978, p.39-49.