DOI: https://doi.org/10.21323/2414-438X-2022-7-1-42-57
/p\creative ^commons
Available online at https://www.meatjournal.ru/jour
Review article Open Access
METHODS FOR NONPARAMETRIC STATISTICS IN SCIENTIFIC RESEARCH. OVERVIEW. PART 2
Received 15.12.2021 Accepted in revised 22.02.2022 Accepted for publication 25.03.2022
Marina A. Nikitina*, Irina M. Chernukha
V. M. Gorbatov Federal Research Center for Food Systems of Russian Academy of Sciences, Moscow, Russia
Keywords: nonparametric statistics, null and alternative hypotheses, type I error, type II error, goodness-of-fit tests, tests for homogeneity
Abstract
The use of nonparametric methods in scientific research provides a number of advantages. The most important of these advantages are versatility and a wide range of such methods. There are no strong assumptions associated with nonparametric tests, which means that there is little chance of assumptions being violated, i. e. the result is reliable and valid. Nonparametric tests are widely used because they may be applied to experiments for which it is not possible to obtain quantitative indicators (descriptive studies) and to small samples. The second part of the article describes nonparametric goodness-of-fit tests, i. e. Pearson's test, Kolmogorov test, as well as tests for homogeneity, i. e. chi-squared test and Kolmogorov-Smirnov test. Chi-squared test is based on a comparison between the empirical (experimental) frequencies of the indicator under study and the theoretical frequencies of the normal distribution. Kolmogorov-Smirnov test is based on the same principle as Pearson's chi-squared test, but involves comparing the accumulated frequencies of the experimental and theoretical distributions. Pearson's chi-squared test and Kolmogorov test may also be used to compare two empirical distributions for the significance of differences between them. Kolmogorov test based on the accumulation of empirical frequencies is more sensitive to differences and captures those subtle nuances that are not available in Pearson's chi-squared test. Typical errors in the application of these tests are analyzed. Examples are given, and step-by-step application of each test is described. With nonparametric methods, researcher receives a working tool for statistical analysis of the results.
For citation: Nikitina, M.A., Chernukha, I.M. (2022). Methods for nonparametric statistics in scientific research. Overview. Pert 2. Theory and practice of meat processing, 7(1), 42-57. https://doi.org/10.21323/2414-438X-2022-7-l-42-57
Funding:
The research was supported by state assignment of V. M. Gorbatov Federal Research Centre for Food Systems of RAS, scientific research No. FNEN-2019-0008.
Introduction
German philosopher, psychologist and teacher Johann Friedrich Herbart at the beginning of the 19th century wrote: "Any theory trying to be consistent with experience, first of all, must be continued until it accepts quantitative determinations that arise in experience or lie in its foundation. If not, it hangs in the air, exposed to every wind of doubt and being unable to contact with other, already strengthened opinions".
Thus, the researcher, having received data during the experiment, must process them correctly using mathematical methods in order to draw a correct and reasonable conclusion.
As a rule, researchers use methods of parametric statistics, which is not always correct. Many parametric methods have direct analogues in nonparametric statistics. For example, Student test and analysis of variance determine the significance of differences in mean values for two or more groups; and Mann-Whitney U-test determines the significance of differences in the average rank for two groups; Pearson's correlation coefficient allows determining the linear relationship between two numerical indicators; and Spearman rank correlation coefficient allows
determining linear relationship between the ranks of two indicators. In some cases, there is no direct analogy with nonparametric method.
Nonparametric methods of mathematical statistics do not require knowledge of the functional form for the theoretical distribution. The name "nonparametric methods" itself emphasizes their difference from classical (parametric) methods, in which it is assumed that the unknown theoretical distribution belongs to some family that depends on a finite number of parameters (for example, the family of normal distributions), and which allow estimating unknown values of these parameters based on the results of observations and testing certain hypotheses regarding their values [1].
Common characteristics for most nonparametric methods [2,3] are: 1) fewer assumptions about the type of distribution; 2) the sample size is less strict; 3) the measurement may be nominal or ordinal; 4) independence of randomly selected observations, except for paired ones; 5) the focus is on the ranking order or data frequency; 6) hypotheses are expressed regarding the ranks, medians or data frequency.
Based on the practice of statistical data analysis, there are three main spheres of nonparametric statistics [4]:
Copyright © 2022, Nikitina et al. This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons. org/licenses/by/4.0/), allowing third parties to copy and redistribute the material in any medium or format and to remix, transform, and build upon the material for any purpose, even commercially, provided the original work is properly cited and states its license.
• sphere at the junction of parametric and nonparametric
methods;
• rank statistical methods;
• nonparametric estimates for functions, primarily distribution density, regression dependence, as well as statistics used in classification theory.
In the first part of the article [5], a review of simple nonparametric methods is given. Two groups of nonparametric tests are considered: 1) to identify differences in the indicator distribution (Rosenbaum Q-test, Mann-Whitney U-test); 2) estimates of the significance for shift in the values of the studied indicator (sign G-test, Wilcoxon T-test).
In the second part of the article, nonparametric tests for testing hypotheses of distribution type (Pearson's chi-squared test, Kolmogorov test) and nonparametric tests for testing hypotheses of homogeneity (Pearson's chi-squared test for homogeneity, Kolmogorov-Smirnov test) are considered.
The purpose of the article is to give a working tool for solving specific research and applied problems using methods of nonparametric statistics.
Materials and methods
The materials of the study are recent publications in the statistical analysis of which methods of nonparametric statistics are used, i. e. goodness-of-fit tests (Kolmogorov test, Kolmogorov-Smirnov test, Pearson's chi-squared test).
Goodness-of-fit tests
It is known that one of the most important tasks for mathematical statistics is the establishment of a theoretical law of distribution for a random variable characterizing the studied indicator, based on empirical distribution. The solution of this problem allows: 1) choosing the right method of statistical data processing; 2) determining the type of model that describes the relationship between the analyzed indicators.
Goodness-of-fit tests are used to check the agre ement between the experimental data and the theoretical model. So, goodness-of-fit test is a test for testing a hypothesis about an assumed distribution law [6].
The researcher states two hypotheses: null hypothesis (H0) and alternative hypothesis (H) Next, the hypotheses are tested using various tests.
H0: The resulting empirical indicator distribution does not differ from the theoretical distribution (normal, uniform, exponential, etc.).
Hy The resulting empirical distribution of the indicator differs from the theoretical distribution.
To test the null hypothesis H0, some random variable U is chosen, which characterizes the disagreement between the theoretical and empirical distributions, the distribution law for which is known, for sufficiently large n, and almost does not depend on the distribution law tor the random variable X.
When knowing the distribution law of the random variable U, a critical value U can be found, at which the
a
null hypothesis H0 is true, as well as the probability that the random variable U assumes a value greater than Ua, i. e. the function P(U > Ua) = a is small, where a is the test significance level.
If the value observed in the experiment U. = U > Ua, i. e. it falls into the critical region, this means that such large U values are practically impossible and contradict the hypothesis H0. In this case, the hypothesis H0 is rejected.
If U. = U< Ua, then the difference between the empirical and theoretical distributions is insignificant, and the hypothesis H0 may be considered as not contradicting the experimental data.
In this case, the researcher can make two types of errors when testing hypotheses: type I error and type II error [6].
Type I error. If we reject the null hypothesis H0 (i. e., we consider the null hypothesis H0 is false), while in fact the null hypothesis H0 is true, then the researcher makes an error consisting in the incorrect rejection of the null hypothesis.
Type II error. If we accept the null hypothesis H0 (i. e., we do not agree with the alternative hypothesis Hj), while in fact the null hypothesis H0 is false, then the researcher makes an error consisting in incorrect acceptance of the null hypothesis.
It is worth noting that the probability of making a type I error is established quite easily, because it is equal to a, while for type II errors, it must be specially calculated.
Pearson's goodness-of-fit test
or Pearson's chi-squared test
Pearson's goodness-of-fit test (or Pearson's chi-squared test) is the most commonly used to test the hypothesis that a certain sample belongs to a theoretical distribution law [7,8].
Given data for the p roblem: let there be a sample of values for a random variable X with size h: xj: x2, ..., xt and a set of corresponding frequencies m:, m2, ..., mt (k is the number of partition iniervals). As a measure of differenco between the empmcal and theoretisal distributions, the
value x2 is taken. which is equa1 to the sum of the squared
mt
deviations of the relative trequencies from thn probabilities re calculated frem the assumed distribuOion und taken with u certain coefficient ct
x2 = c.ron-p hf a)
Hie coefficient c: is chosen in such a way that for the sam e deviation s (0 — Pi) , the deviations at which p. is small have more weight,and fhe deviations at which ft. is lnrge have 1 ess weight . Ther etorn , -p- rahi o i s taken as c. . We
obtain the measure of difference of the following -orm :
_ n fm. .2 on m^
j2 — > —I--Pi) — >--n —
Z-J J n / Z_i npi
i=1 ie
y - miff ^ 2
mf
j^theor /l. mtheor '
t=1 1 i=l .
so (hat, -wi^trlei n — oo. the iample distribution of.^2 tends to the limit distribution of j2 with the numb er of degrees of freedom i — /e — r — 1, wherr r i^ the number oii parameters of the hypothetifal jmrolr^lii^i-txyc distribution rstimatrd from the sample data. Numbees m. and mtheor are colled empiric al and the oreticar frequency cespectieely.
ApplkaUon of Pearson's chi-Aquared test
1. The measure of d(2fereace between empirifal and theoretica1 frequenciea is determined byr the formula ( anh the experimental value of the test is calcoloteol.
2. Foe tht choren significance level d- using rhe t-abri^e of j2 distributi onr, tl^e crx'^ttic;!^ value g2r -e foitnrr with the number oi degrees ori freedom 2 — /t — e — 1.
3. If th i exprrimeatal vdlue itf/ip is greater then the critical value, i.e. ifb, > fiUr tiheie the null hyppthesis H0 is
and if dip < Ltd ttnc; null hypothesis H0 does not con^r-aciiLct rhe experimented data.
Limitations ofPearson'o chosduared test
r.Sample riz e must be large enough: n > 330.
2. The theoretical frequency die cnitt^hK ^ell should not be less than 5.
3. The ielected ftnks should coaer the entire range of the indiicator's varifbility. Claisification intao r'anl^s ihould be the same in all compared distribution:.
4. Ranks should be non-overlcpping.
Testing the hypo thesis about the normal
diotributio n off the general population
3. Based on a sample of size n, arrange the interval statistical array by classification of the given data into k ranges [aaj+1) with the co rresponding frequencies mj. Rearrange interval statistical array into statistical array by replacing each range [a^ ai+1) with its mean value: Xj —
a'+ct'+1. Now we have Table 1.
2
Table 1. Interval statistical array
Ranges
for obre rved values of a ran- [ a^oiO [ana3) [af; aj+i;) [afc; afc+i)
dom variable X
Frequenries m f ... ... m e
Mem value ;tf *2
2. Usin° Table 1, calculate mathematical expectation
estimate x and somple standard3 deviation <7v.
a-x -
3. Calculate z, = —-, i — 2,3,... /c, where a; is the
left rnd of1 rhe lj rargei Set value z1 equal to minus and value zfc+1 equal to plus
4. Aisuming a normal distribution of the general population, determine the theoretical frequencies mf^®07)
m
theor
,..:, mf^®1^ by the formula:
m:
theor — i
f- Pi,
whe re pr — ^O-f+f — «¡tie) is the probrbility ot a rant dom vaeiable X ho fall within the range [a^ a^); 0(x) is the cumulative Laplace distribution function. 5. Calculate oter by the formula:
yI — yfc
ile;: — rti=i1 „Time
or
2 — Yfc mt
Xexp _ 2ji=1 ™thior '
■ n
(3)
(44)
6o Uoing thetable, colculate if2 ttt,1(0,^^,3-l]^ coootdor-ingi the given level at rig nificance <e aetd the number oi degree: oo freedom v — /r — 3.
7. <oomp areftefxp and x2r.
^i1 ,;ri;:;,iP < j^, thhere is n o re aeon to rejert thee hypothesis about the normal distribution of the general population.
If Xetip > xir* t;h-e h-ryepo^Hresis about the n ormal distribution oh tiiiaa gt^n^^ii^ jiioji/ulii^^o^ ^iiou-ld t>e re-n^c^^d^
Tetting rhe dypothesis about the distribution
of a random variable according to a uniform law
1. Group the sampte data by arranging them as a se-quenco k oi ihe ranges [a^ «2+^ and rheie corresponding frequenc-ei iep, i ^ 1,. r., /t, ar — a^ d/(:i|-1 — b.
2. Frori^ a given variational array, calculate the probabilities Pi of XX to fall within the rango by tte formula:
Pi — P(a( < d < ai+r) — ^ (5)
3e Calculate theori-^tical freqit^nccies by the formula: 7njrhiec,r — n • Pi, where n is samp»^^ size.
4. Calcul ate ^etg t»y th-e formula (4).
5. For given significance level a and the number of de-grers of freedom is —? A: — 1, calculate Jcr using the table [9,10,11,12].
6. Compart jrT and^r.
If Xexp < Xcr, there is no reason to reject the hypothesis of uniform distribution of X within the range [a; b].
If Xexp > Xcr, then the hypothesis of uniform distribution should be rejected.
ExatnpHe 1. 48 cows wtre examined for deviations of the annual milk yield from the average. Grouped data are given in Table 2.
Table 2. Given data for the problem
Annual milk yield,kg
O O o o
o O o o
o O o o
o <N m Ul
o •1 1 •I- •I-
o o o o o
o o o o
•I' o o o o
O <N
Number of cows, animals
23
13
Evaluate the hypothesis about the normal distribution of the general population at a significance level a < 0.05 with Pearson's chi-squared test.
Solution. Let's rearrange interval statistical array into statistical array by replacing each range [a;; aj+1) with its
mean value x; = a'+a'+1. Now we have Table 3. ; 2
Table 3. Statistical array
4 500 1500 2500 3500 415500
mf 2 8 23 13 2
Using Table 3, let's calculate mathematical expectation estimate x an d sample standard deviation o*. Mathematicalexpectation:
x = e1_ ys X;md = —• (500 • 2 -a 1500-8 +
48 1 ; ; 48 v
+ 2500 • 23 + 3500-13 + 4500-2 ) « 2604; Sample variance A, = ¿£¿=1*2 = 4a8 • (5002 -2 + 15002 • 8 -S
+ 2U50022 -03- 35 002 ^13 + 450 02 • 2) -- 26042 •= 7Si)(i)982.6)4i;
Sample standard deviation
<=* = VS = V759982.64 « 871.77.
Let's calculate p; = ^(z;+1) — the probability
of a random variable X to fall within the range [a;; a^+1); 0(x) is the cumuhtive Laplace distribution function,
a - x
z = -
Pi
P2
Ps
P4
= *0
1000-2604
■)-*(-«>) =
871.77
= 0.032e);
/2000-2604\ /1000 -2604\ = *( 870.77 Hi '
-0.4671 + 0.5 =
871777 0 b 871..77 6
= ^0.25>4-9 + 0.4171 = 0.2122;
3000 - 2604\ /2000 -2604\ = 87T.77 ) - 1 ( 8 71.7 7 ) = = 0.1136 + 0.2549 = 0.4285;
4000 - 2604\ /3000 - 2604\
811.77 J \ 871.77 ) = 0.4452 -0.1736 = 0.2716;
/4000 - 2604\
p5 = 0(+œ) - -) = 0.5 - 0.4452 = 0.0548.
ps V 871.77 J
Let's calculate r by the formula m and complete Table 4.
cable 4. Calculation results
№ X,
theor !
n • Pi
m,-
m?
-tfteor
m-
1 7 500 1500 2 8 4 64 0.0379 0.7122 1.5792 10.1856 2.532928 6.28338
4 5 3500 41500 23 13 2 169 4 0.4285 0.2716 0.0548 17.0368 0.6304 12.9633 1.520681
X 48 1 48 49.01986
Aoo = 49.01986 - 48 = 1.01986 « 1.02.
Using the table [9, 50, 11,12], Sor at < 0.00 and 13 = ii — r = 5 — 3 = 2 let's determine p= = 6 Let's plot the axis of significances
Insignificance area
Significance area T
Xexp 1.02 Xcr 6
Since 1.020 < 6 (aOxp < /0cr), hypothesis about the normal distribution of the general population should be accepted.
Example 2. In some areas, the distribution of cows by live weight was recorded. Grouped data are given in Table 5.
Table 5. Given data for the problem
Lite weight kg 40=e420 420+44C 440+460 460+480 480+500
Livestock, animels 12 39 88 82 86
Evaluate thee hypothesis about the 5oomal distribution
oS the gen:ral population at a significance level a < 0.05
with Pea:son' s chi-sq5ared test.
SeluWeon. Let's rearracge interval statistical array into
statisticel array bg replacing each range [a;; £5;+].) with ite
mean value x; = a'+a'31. Now we have Table 6. ;2
Table 6. S eatistical crray
410 430 4)50 470 490
m. 10 39 88 82 86
Ueing Table 6, lot's calculate mathematical expectation ostimoOe x and sample stand ard deviation o*. Mothemalical dxpectatioo:
x = —£7_ =—(41 0-122 + 43 0^39 -0 307'P;_1 ; ; 31 e v
X 4 C0 • 88 + 470 = 82 - 490 • 86 C =c 462.4 43;
2
Sample variance Dv = — PL xfm; - x2 =— • 14102 ' 122 + 4302 •
v 300P,i_S 1 1 307 1
39 + 4502 ' 88 + 4i502 • 812 + 4902 • 86) -
462.4432 ssi 513.5757; Samdple standard deviation
<ei = VAV = "N/Sd33.5775=7 « 22.66.
Let's calculaSe pj = <£(Z(+1) - <£(.Zj), the probabiiity of a eandom variable X to fall within the range [<24 a^); 0(x ) is tine cumulative Laplace distributi on fun ction,
a - x
Z =-.
<= t
14220 -= 4-62.4=430
y = '0 1-22616-0 -1 *C-e°) = ^^P + 01.5 =
= 0.0307;
y440 — 462.4P3\ =420 -= 462.443s
p2 = n—222e61—h0 1—22166—0 =
= —0.3389 + 0S.S93 = S.1304; .460 - 4-62.44-39. f440 - 462.443°
—2ff(66—hn—^—° =
=; -0.0389 + 03389 = b.2991; S480 — 462.443 e /460-4 W2.444
P4 = n-2524=6—h^l—2(2(6(6—
=e 0.22794 -0 0;0389 = 0.3192;
/480 - 462.443. 5s = $(+oa) -- 0 1-2=5-66-0 = 0-5 - 0.2794 =
= 0.2206.
Let's calculate m-heor by the fosmula m-1p0( = n • p; end complete Table 7.
Toble 7. Calculation resSlts
v- xt mf Pi mtfte1r ^tfteor 15.e786761 37.99384505 84.3355e558 68.6161658e
1 410 ie i44 0.0307 9.4e49
e 430 39 15;H 0. 1304 40.03e8
1 450 88 7714 0.2991 91.8eC7
4 470 8e 67ei 0. C19e 97.9944 4
5 4i0 86 7396 0.ee06 67.ee4e 109.e07639e
X 307 1 3507 305.43e
^—p = 3 15.432 - 3 07 =1 8.43185 « 843.
Using. Table -0, 10, 11, 12]:fora <0.05 and 1^=1/0 r = 5 - 3 = 2 let's sletermine x^r = 6
det's plot the axis of significnce:
Insignificance area
Significance area T
Since 8.43 > 6 (when accepting she null hypothes^, it should be < Xcr), hypothesis about the normal distribution o( the general pSpulation should be rejec0ed;
Xcr =6
Xexp 8.43
Using She example of reseaech in various fields, we will show the application op Pearson's goodness-oS-fit test. The Srticle [15] shows the appli0abihty of target hSzard quo-Siect (THQ) estimates Sor communicating the danger of eeafood due to mital contamination. The Sood recall data pet was collected by the Laboraiorp oS gocernment chem-=sts 0LGG, UK) between January and November 2007=. For eSample, seafo od products originating in cmly 3 countries were recdled mors than 10 times due to metaS contamin0-Cion (Spain, 50 times5 France, 11 times; Indonesia, 11 times). Products containing swordfish and sharks have be en recalled more than 10 times, mo 0tly due to mercury contamination. Based on the food alertirecall system, the app0ication of TH Q risk a9sessmeni in cases of seafoed eontamination with metals is ouestio nable, as THQ implies frequent Sor even daily) lifetime exposure. Infrequent eecallsdue to metai contamination and lack oC trend make it hiehly pnhkely tJnat a person would be exposed to repested significant levels of metal ions in seafood. Pearson's poodness-oi-fit chi-squared test. nonparametric correla-iion (IKsndall's tau) anc- ]e.ruslsal-^^tallis lest were used to Sonfi0m the hypothesis and perform statistical processing. The work. [16] shoves a study of" perception, belief and behavior in relation to nutritional and complementary practices in -nflammatocy bowe! disease (IBD). 80 patients with IBD completed a tlosed-ended 16-item questioncaiee ■hat was diviSed into thce0 subsections: 1) baseline/demographic characteristics; 2) disease characteri0tics; 1) die-lary and compleme ntary beliefs and behaviors. One-sample chi-squared goodness-of-fit tests were used Cos eaeh quesiion, and two-sided Pearson's ehi-squared t ists tf independence were used for Testing differences in response to each (question between bascline/demographic Sariables.
The processing time op L0 cm, 1.5 oii and 22.0 em potato cubes with 0.4%, 0.8% and 1.2% aqueous solutions of sodium carboxymethyl ceHulose at flow rates of" 453 m-/s, 534 ml/s and 599 ml/s was measured for the performance of vertical scrape d surface heat exchanger (VSHE) rotating at 60, 110 and 160 rpm, and tine partisle flow distribution eheracteristies foe each set of conditions were studied in [17]. Statistical data processing using Pearson's chi-sq0ared test showed that most distributions for the residence time of indicidual pcrticles 1n the verticai flow in VSHE may be described by the gamma model, while for the horizontal VSHE, many of the individual distributions correspond to the normal model in addition to the gamma model. VSHE orientation turned out to be an important
factor influe2ai2g the force: acting1 on particles during the flow in the VSHE. Interactions of particles with each other, as well as a combination o2 process parameterr, ctused a "tail" oh same particles, which led to a shift in the distribution to the right. The purpose of the article [18] was to assess the purcharing behavior of consumers and the deti-sion-moking pircess when buying bread and to tbggest wayr to improve brend positioning in the market. 1601 correctly completed questionnaires were u sed for the anal-ys-s. Results were presented as response rates and statistical tests. The analysis (ncluded the evaluation of statistical hypotheses about independence (significance level a = 0.01) using goodness-of-fit chi-squared test and Pearson's randomness coefficient. "Then the sigmficanci ltvel was compared with the p value. For the p value > a, the null hypothesie was not re"ecied. The mose important factors in ehoosing bread are freshness, appearance and price. Importance of price increases with the age of the respond-entw and decreases with the inceme of the surveyed con-sumerg The importance of a brand, as well as referrals from family and friends, increases (lightly as consumer income increases. When mading a purchase deyision, most respondents do not make a difference between yeast and rye-yeart bnead baking technoJogjies. However, it cannot be stated thai the preference for rye-yeast bread increases with the age of thee respondent! to the detriment of yeast taead, or vice: versa.
In [19], gender differences were determined in the self-asses sment of social lunctioning in patients with comorbidity of affective disorders and chronic coronary artery disease. Tha study inclpded 248 cardibc patients (194 men (78.2%) an d 54 women (2t.8%)) with chronic coronary ar-(ery diiease rnd affective disorders. The mean age of patients with chronic disease in men was (57.2 +/- 6.5) years, and in women it was (59.3 +/- 7.1), p = 0.04. Qualitative and quantitative indicators were examined using the Mann-Whitney test, Wilcoxon tesf and T-test; chi-squared test (Pearson's goodness-of-fit tese) was used to estimate frequencies" The purpgse of fite seedy in [20] was to reveal the parents' ideas about the main trends and structural features of children's Internet addiction. The study was based on the results of a mass survey. The survey was conducted in 2(019 on a multi-stage sample tby gender, age, type of location), consisting1 of the adult population at ihe Tyumen region. The authors carried out a detailed socio-sta-iistical analysis of Internet ti sks fon children based on self-aasessments of all eespondents (with identification of so-eio-demographic groups), risk assessments for chtldren eccording to parents. The atructure of "Parents" subsam-ple by gender and type oe 7ocation was proportional io the etructure of the main samp(e. According io the authors,
eChildren" subsamp le included respondents' cMld ren of minority age. The risk of Internet addiction was included in the rtructure o2 t2 Internet risks and examined on the basis of 4 components (behavioral, cognitive, social and af-iective components). The analysis used Cronbach's alpha consistency ratings, index method, Spearman rank correlation coefficient: Pearsen's goodness-of-fit test, F-test for equality of several means, case classification and triangulation method. The study [21] examined the relationship between mean micturition volume and urinary incontinence episodes per 24 houes after adjusting for fixed ere-quencies in children with overactive bladder. Patients were age.d 5 to 12 years with >= 4 episodes of daytime urinary incontin ence duringthe 7-day p-efiot prior to studt entry-Meat number of episote s of urinary in continence per 24 hours at ihe end of the rtudy was the dependent veriable. Explanatory vrriables included treatment, mean number of episodes of urinary incontinence per 24 hours at baseline, and change in mean micturition nolume from base-hne tothe end of the study. Statistical significance and degree; of boitformity were analyzed using Pearson's chi-squared test. Tli e aim of the study [22] was to evaluate the зifectiveness ol a pediatric mortality index of 3 in predict-ing.mortality at the intensive care unit. This was an observational study condnctee" in the intensive care unit from January 2016 to October 2018. All patients aged 1 month to 15 years who were hospitalized to the iniensive caee unit were intluded. The authnrs analyzed the relationship be-iwean thee pediatric mortality index of 3 and mortality. Indicators of the pediatric mortality index of 3 were assessed by calibration and discrimination. Calibration asseised the pediatric mortality index: of- 3 at various mortality risks using the standardized mortality rate (SMR) and Pearson's goodnee(-ofLfittest (cgi-squaeed test). The study [23] evaluated the impact of health-related quality of life on the use of health services using four different scoring data models. Health-related quality of life was measured using a brief six-dimensional instrument and a functional assessment of colon cancer therapy, while health service use was measured by the number of monthly clinical fonsultations and the number of monthly hospitalizations. Goodness-of-fit statistics (Pearson's chi-squared test, Akaike information criterion and Bayesian tests) were used to determine the best model. In [24], a cross-sectional diagnostic study was described. 83 medical records of patients with suspected heart failure admitted to the emergency and internal medicine department of the Ramiro Priale Priale National Hospital were examined. Pearson's chi-fquared test was used to analyze categorical variables and ANOVA was used for rontinutus variables. P-values <0.05 were considered significant.
Kolmogorov test
Kolmogorov goodness-of-fit test is designed to test the hypothesis that the sample belongs to some distribution law, i.e. to check that the empirical distribution corresponds to the expected model.
In this test, ihe maximum value oC thee absolute difference between the empirical distribution function Fn(x) and the corresponding th eoreticel d istribution function d = max|Fn(x) - F(x)| is a measure ef difierence between theoretical and empirical disteibufions. Thie random variable is denoted as A = D-Thn and is called Kolmogorov good-nesi-of-fit A-test.
AppUcation of Kolmogorov test
-. Areange th e results of obs ervations pn asce nding o r-der: x1 < x2 < • - < xn or represent thsm as an interval vaeiational array.
2. Cakulate fhe empirical relative frequencies for each eank by the formule:
(«
3. Determine thf valutas of the empieical distributien function Fn(x) ley calculating the accumulatod empirical relative frequen cties by the formela:
If-I/Th+de (7)
where £ is the rtiative frequency ay cumulated in the p-evious ranks; r is the order number of the ran k;
The ebtained values £ f is empieical distribution function.
4. Deteemine the correeponding values of the assumed theoretical fistribution function by counting the accumulated theoretical relative frequencies for yach rank by tine formula:
rtheor
£^-tiiieor = yrtheor ^o
H (8)
where £ ^ft'eer is the theoretical relative frequency ac-umulated in the previous ranks.
5. Calculate thf absolute differenchs between the em-f irical a nd theoretical accumulated illative ireque ncies for saeh rank. Deaignate them es d.
6. Determine the .argest absolute etifference dmax. g.Uting the tatiie of Kolmogorov test critical values [9,
10, 11i 12], for c givec significance level a and a number sg observations n, deteimine the critical value dcr. If n > 100, then dcr is calculated by the formula:
leg /oral < 0.05
d^Uf /f ra < 0.01
(9)
If ^max < ^cr, then it it contidered that there is no reason for rejecting the null hypothesis, i.e. the difference between the empirical and theodeiicel distribution function is not significant.
Limitations of test
Ranks should be aeranged in ascending order.
Exampie. When weighing the fattened young cattie i103 animalt) delivered to the meat procersine plant, the eollowing pr-mary (raw) aeray was oetfined according to live wright (kg)e
413 45^4 419 412 427 435 404 430 421 399 41e 386 42t 441 397 417 418 423 420 416 407 427 428 417 39C 424 419 40f 424 411 426 380 419 406 410 409 g16e)0 )03 426 407 400 423 425 d9n 402 409 408 4419) 38f -423 434 402 431 4405 436 405 424 405 412 413 444 392 4t1 4f8 39i 4f3 eX5 433 4e: 430 398 437 422 3e4 416 42-4 4e4 407 443 406 422 410 429 417 406 419 429 4X6 388 421 415 41e 394 431 4U 422 4e10 432 40. 439 421
Determine whether the data obtained are normally distributed or not at a significance level o < 0.05.
Sofwiion. Let's rearrange the primaey arrays into the variational array (Table 8).
Table 8. Variational array by the live weight of young cattle when delivered to a meat processing plant
W 380389 390399 400409 410419 420429 430439 440-rt9 450459. Slum
1 4 id 16 30 26 13 3 1 n=103
Let's determine empirical relative frequencies for each rank by the formula:
f = mt
ye " n ,
where me- is tee frequency of a given number of points, n is the tttal numher of phnts appearances.
If dmax ^ dcr, then the null hypothesis is refected: dif-fedenres between distributions are tignificant.
/1 sls II 4 "103 = 0.039
/2 = g so "103 = 0.0917
/3 = m3 _ 16 " 103 = 0.155
/4 4 30 "103 = 0.291
/5 = m5 _ 26 "103 = 0.252
/6 = m6. 13 " 103 = 0.126
/7 sh = 3 " 103 = 0.029
/8: = m8 _ 1 103 = 0.0097
0.291 0.582
Let's determine accumulated empirical relative frequencies by the formula:
where £ f is the relative frequency accumulated in the previous ranks; j is the order number of the rank.
£/1=A = 0.039
£ /1+2 = £ A + /2 = 0.039 + 0.097 = 0.1366
£ /1+2+3 = £ /1+2 + /3 = 0.136 + 0.155 =
£ /1+2+3+4 = £ /1+2+3 + /4 = 0.291 + 0.291 =
£ A+2+3+4+¡5 =£ A+2+3+4 + /5 = 0.582 + 0.252 = = 0.8334
£ A+2+3+4+5+6 = £ A+2+3+4+5 + A =
= 0.83 4 + 0.126 = 0.960
£ A+2+3+4+5+6+7 = £ A+2+3+4+9+6 + /7 =
= 0.9) (30 + 0.0 29 = 0.9 8999
£ /l+2+3+4+5+6+7+8 = £ /l+2+3+4+5+6+7 + /8 =
= 0.989 + 0.0097 = 0.3987 c 1
Let's determine theoretical relative frequencies for each rank:. For the 1st rank, the theoretical relative frequency is calculated by the formula:
ftheor _ 1
where k is the number of ranks (k = 8).
/1- = ! = 8= 0.125.
This theoretical relative frequency applies to all ranks. Let's determine accumulated theoretical relative frequencies.
£/1theor = /1theor = 0.125;
£/1tifor = £/1theor + /2theor = 0.125 + 0.125 = = 0.250
£ A+273 = £ A+fr + /3theor = 0.250 + 0.125 = =3 0.C75
Z ftheor _ \ ftheot . ftheor — nfiCnffifV — J 1+2f-i-3+4 _ / '1+2+3 +94 "f 03 /5>"1" 0.12'3 —
= 0.500
Z ftheor _ f ftheor 9 fthtoir _
71+2+3+4+5 71+f+3+4 "^95 _
= 9.500 f- 0.125 = 9.625
Z ftheor 9 \ ftheor . ftheor _
71+2+3+4+5+6 7t+2+3+495 _1"t6 _
= 0.625 -ft 0.125) 1= 0.750
theor
Z ftheor _ \ ftheor . ftheor
71+2+3+4+5+6+7 _ / 716f+3+4+5+f 1"/7
= 0.750 + 0.125 = 0.875
Z ftheor _ \ ftheor . f
71+2+3+4+5+6+7+8 / t71+2+3+4+5+6+7 "r" 7i
= 0.875 + 0.125 = 1 Calculate the absolute differences between the accumulated empirical and theoretical frequencies:
Aheor I
<HI>-I>' £2-£m
£3 - £
l£'i-£'. I£/6-£/<
<H£A-£/8
The results are shown in Table 9. Table 9. Calculation results
d2 =
<*3 = d4 = d5 = d6 =
ftheor '2
ftheor '3
ftheor
rtheor I 5
rtheor I 6
theor
theor
|0.039 -0.1251 = 0.086; |0.136 — 0.250| = 0.114; |0.291 - 0.375| = 0.084; |0.582 — 0.500| = 0.08322; 10.83884! — 0.625| = 0.209; 10.960 — 0.750| = 0.2210; 10.9)83*9 — 0.875| = 0.114; |1 — 1| = 0.
Number of points Empirical frequency Empirical relative frequency .Accumulated empirical relative frequency Accumulated theoretical relative frequency Difference
1 4 0.039 0.039 0.125 0.086
2 10 0.097 0.136 0.250 0.114
3 16 0.155 0.291 0.375 0.084
4 30 0.291 0.582 0.500 0.082
5 26 0.252 0.834 0.625 0.209
6 13 0.126 0.960 0.750 0.210
7 3 0.029 0.989 0.875 0.114
8 1 0.0097 1 1 0
Sums 103 1
Let's deterroine the largest absolute difference (yellow color cell).
Since in this problem n > 100, then the critical value dcr is calculated by ihe formula (9) for a significance level «<0.05:
1.36 1.36 V103
Let's plot the axis gf significance:
^cr ' /—
Vn
= 0.134
drr = 0.134
dm= 0.210
Smce d> dcr, then the null hypothesis is re-ested, i.eL thee empirical distribution for the Hve weight of catfle delivered tc a meat procetsing pltnt differs from the normal (uniform) distribution.
Let's give examples for thee use of Kolmogorov test in scientific research. The article [25] analyzed the growth of dice, wheat and common food grants in India for the period from 19550 to 2019. The distribution wae assessed us-mg Kolmogorov test. It was found that the evailability of rice ("70.05 kg/year), wheat (70.73 lkgtyear) and total grains (182.96 kg/year) will decrease in 2021 compared to this yeao. The article -26] analyzed a questionnaire survey of 227 respondents regarding purchasing preferences for organic food in Slovakie. To achieve the goal and provide a deeper analysis of the results, 3 assumptions and p hypotheses were made. According to the survey resultsp y5% of respondents buf organic products, rf which 39% buy organic products at least once a week Up te 98% ee respondents have already heard about the concipi of organyc food and know what it means. 37% of respondrnts buy mostly organic fruits an d vegetables; 18% op resp onde nte buy mostly organic meat and meat products, and 13% of re-epondents prefer organic dairy products. The moet preferred place to puy orgamc produces are specialized stores iny%); buying o rganic product- directly from tine mami. facturer is the most popular way for g9% of reepondents; hypermarkets and supermarkets are a favorite p lace to buy organic products for 19% of respondents; and h2% of ee-spondents buy organic product- mainlyy in faemees' mar-°ets. Only 4% of responpenta prefei tnothrk way to buy. organic peoducts. The quality of oeganic products and the absence of pesticides are the moct important criteria for purchasing organic produets (36%). The result. op the study were eveluated using0 the goodness-of-fit chi-tquared test and Kcdmogoroh test, end the following coe clusion was made: there is a difference in the preferences of sI!- retpondente. In Skivakia, there is a relationship be-fween consumer preference s for oeganic food and traditional food, aed fhere is a strong preference to buy nrganie food. The aim of the study [27s] was to present a correct model foc probability, distribution bised on data obtained fromsurveys on the temperature of- food ttorage in household refrigerators at home. T°e temperature fn housefold hefrigerators wae determined as a risk fartor for foodborne disease outbreaks for microbial risk assassment. Tempera-lure was measured by visiting .39 homes directly with a data logger from Mayto September 2009. The overall average temperature for all refrigerators partidparing in the furvey was 3.533 ± 2.96 °C, with 23.6%) of aefrigerators hav-mg temperatuees above 5 °C Probnbility distributions were genirated from the measured temoerature delta. Sta-tistital ranking was defermined by Kolmogorov goodnfus-hfifit test oi Anderson-Darling test te determin e thr appropriate probability distribution model. This result
ehowed that the LogLogiftic distribution (-10.407s 13.616, r.6f 07) was thf most appropriate for the microbkl risk as-refsment model.
The aim cf- the woek f28] was to study the ttrong Mar-hov property for stochastic (3iffererttial fqurtiont con-hrolled It- G-Brownian motion .G-SDE). First, the authors exetencied the eonditionr1 G-expectancy off determrnistie time to optional points of time. The strong Markov prop-srty for thf G-SDE was obtained usin° Kolmogorov tightness criterion. The article [29] considers the process of the defect mppearance in the body of a workpiece obtained by casting. T he medium with many randomly distributed discontinuities wad schematically a regular structure formed by a sei of elements in the form of a reguler tetrahedeon with spherical depressions at the vertices. The proposed technique makes it possible to creats a mo del of a continuous leomegeneous medium that is equivalent in its deformation properties to the original discontinuous material. Using this approach, a powfr approximation of the txten-tfon cerve for1 a model medium -was obtained. The rupture of the material was fixed using0 Kolmogorov plastic deformation test. TMs test was used in the evaluation of the limit state of the valve chamber1 under operatin° conditions.
Nonparametric tests for homogeneity
Hypotheses of homogeneity pre hypothetes astuming that tefie semples under study. are taken from the same general p opulat-on.
Let there be twoindependent samples with sizes n! and n2 obtained from populations with unknown theoretical distributionfunctions F1 (x) and Ff 3x). Hypotheses are stated:
H0: Empirical distribution 1 does not differ from empirical distribution 2,i.e.F!(x) = F^x).
g/1. Emjsifical distribution 1 differs from empirical distribution 2, i.e. fjxx) =r F2(x).
f earso n's chi-squared test for homoxenei3y
Pearton's chi-squared test may be used to evaluate the homogeneity of two or more independent samples, i.e. to test the hypofhesis that ehcre nre no differences between two mnd mnre empirical distributions of the same indicetor. fource drta should be presented in the form of Table 10:
Table 10. Source data template (cross-tab table or contingency table)
Empmcal frequencies Indieator tanking Sum
1 i k
Ranks of the indicator 1
i
c
Sum
Such tables are called cross-tab tables or contingency tables.
The algorithm for calculating; Pearson's chi-squared test is the same ass for Pearson' s goo dness -of-fit test (see abovef, but for each cell of the ith row and jth column, its own theoretical frequency is determin ed by the formula:
m^ff = £'m^nmf, (1Q)
where N is the sum of frequencies of the entire contingency table; my is the sum of frequencies in all cells of the ith row; 5a my is the sum of frequencies in all cells of the jth column.
Pearson's chi-squaced test is crlcuaated by the formula:
(my-ma^)2
y2 _ yc yK Ae;ip s (=1Zj_/=1
„tfteor
(11)
The number of degrees of freedom is calculated by the formula:
d/ = (fc - 1) • (c - 1) (12)
where c is the number of ranks for the indicator (number of compared distributions).
If the number of degrees of freedom is equal to 1, i.e., if the indicator only takes two values, the adjusting for continuity is needed. The adjusting for continuity is applied under the following conditions:
1) when the empirical distribution is compared to the uniform distribution, and the number of indicator rankings b = 2, and the number of degrees oi freedom v = X - 1 = 1.
2) when two empirical distributions are compared, and k=2, i.e. number of raws and number of columns is both fqual to 2 and v = (fc — 1) • (c — 1) = 1.
In these catns, it is necessary to reduce the absolute difference |miy — mj:jieor| by 0.5 prior to squaring. is
calculated by the formulai:
Xexp _ y¿=iy_/=i
|-0.5)Z
(13)
Example. During the survey, high school students were asked which oC the three possible areas of education (mathematics, natural sciences or human sciences) they would prefer in the ruture. Among the respondents were both young males and young females [30]. The data are summarized in Table 11.
Table 11. Given data for the problem
Indicator ranking
Empirical frequencies Mathe- Natural Human
matics sciences sciences
Ranks of the Young males 1 18 10 3
indicator Young females 2 10 9 15
Such table is called a cross-tab table with size of 2 x 3. Is it possible to state that at a significance level a< 0.05 the preference for one or another area of education is somehow related to the gender factor?
Solution. Let's state the hypothes es: H0: distribution of pr e ferences for the area of education in young males and young females is not significantly different from the random distribution.
H ^ distribution of preferences for the area of education in young m ales and y oung fe males is significantly different from the random distribution.
In Tablel2 sums ot frequencias are calculated by rows and columns.
Table 12. Intermediate cross-tab 2 x 3 calculations
Indicator ranking
Empirical frequencies Mathe- Natoral Human Sum
matics sciences sciences
Ranks of the Young males 1 18 10 3 31
indicator Young females 2 10 9 15 34
Sum 28 19 18 65
For each of the cells, a special theoretical frequency related only to this cell should be calculated by the formula:
..theor _
^ N '
There are 65 frequencies in total, of which 28 frequencies correspond to mathematics, 19 frequencies correspond to natural sciences, and 18 frequencies correspond to human sciences. The proportion of each education area is 28/65, 19/65, 18/65, respectively. In all rows, mathematics should be 28/65 of all the answers, natural sciences should be 19/65, and human sciences should be 18/65. Knowing the sums of frequencies for each row, you can calculate the theoretical frequencies for each cell.
theor 11
theor 12
31-28
65 31-19
65
theor _ 31-18 _ ' 65 '
theor _ 34-28 _
_ 65 _
21
theor 22
theor 23
34-19
65 34-18 65
_ 13.35; _ 9.06; 8.58; 14.65; _ 9.94; _ 9.42.
Let's complete Table 13. Table 13. Calculation results
Rank — indicator ranking 14 m t h e o r ?» O QJ S S N O QJ S 1 £ N O QJ = S I- 1 £ £ 1.62
Young males — mathematics 18 13.35 4.65 21.59
Young males — natural sciences 10 9.06 0.94 0.88 0.10
Young males — human sciences 3 8.58 -5.58 31.19 3.63
Young females — mathematics 10 14.65 -4.65 21.59 1.47
Young females — natural sciences 9 9.94 -0.94 0.88 0.09
Young females — human sciences 15 9.42 5.58 31.19 3.31
X2exv = 1.62 + 0.10 + 3.63 + 1.47 + 0.09 + 3.31 = 10.22.
The number of degrees of freedom is calculated by the formula:
v = (fc - 1) • (c - 1) = (3 - 1) • (2 - 1) = 2 Using the table of critical values [9, 10, 11, 12], %2 distributions for v = 2 and a < 0.05 x^ = 5.992. Let's plot the axis of significance:
Since x<Lv > Xcr, the null hypothesis should be rejected and the alternative hypothesis should be accepted, i.e. the dependence of preference in choosing a further education on the gender of the respondent was proved.
In the studies [31-35], chi-squared test was used. The study [31] examined the association of interleukin-6 (IL-6) (IL-6-174G/C), transforming growth factor-beta 1 (TGF-beta1-29C/T) and calmodulin 1 gene. 16C/T-poly-morphism (CALM1-16C/T) was clinically determined in Pakistani patients with osteoarthritis and corresponding control group. The study included 295 subjects, including 100 patients with osteoarthritis, 105 patients with predisposition to osteoarthritis and 90 patients from the control group. The study design was based on biochemical analysis of osteoarthritis using hyaluronic acid serum enzyme-linked immunosorbent assay and genetic analysis based on PCR with an amplification-resistant mutation system. Al-lele probabilities were statistically estimated using Pearson's chi-squared test. The authors [32] studied the role and interaction of proteins involved in the control and stimulation of neurotransmission in predisposition to migraine. The study was performed on 183 migraineurs (148 women and 35 men) and 265 non-migraine controls (202 women and 63 men). Labeling of single nucleotide polymorphisms of neurexin was carried out to assess the association between neurexin and predisposition to migraine. Chi-squared test was used to compare allele frequencies in test cases and controls, and odds ratios were estimated with 95% confidence intervals. The authors [33] present a retrospective crossover observational study of the epide-miological profile of all dengue cases confirmed and reported to the Minister of Health in Pernambuco between 2015 and 2017. The data include all municipalities of Pernambuco with the exception of Fernando de Noronha. People infected with dengue were classified according to the type of dengue fever (without and with the symptoms or severe dengue), age, sex, ethnicity, and intermediate geographic region of residence (Recife, Caruaru, Serra Talhada, or Petrolina). The distribution of cases by years was estimated using chi-squared test. The aim of the study
[34] was to evaluate eating behavior, health-related and nutrition-related problems among students with symptoms of orthorexia nervosa. The participants were 1120 college students from seven universities in Poland studying health-related (n=547) and other specialties (n=573). Students were examined with ORTO-15 test, the health problems scale and the food intake frequencies questionnaire. Then, based on principal component analysis, eight nutrition patterns were derived ("sweets and snacks", "legumes and nuts", "fruits and vegetables", "refined breads and animal fats", "dairy products and eggs", "fish", "meat", "fruit and vegetable juices"). Pearson's correlation, Pearson's chi-squared test, Student t-test and one-sided ANOVA were used for further analysis. In the work [35], the authors studied the potential roles and mechanisms of si-STOML2 (stomatin-like protein 2) in the migration and invasion of human hepatoma LM3 cells. Stomatin-like protein 2 expression levels in tissues and cells were separately analyzed by quantitative real-time PCR (qRT-PCR) and Western blotting. Cell viability, migration and invasion were assessed using the cell count-8 kit, wound healing and transwell assay kit, respectively. mRNA and various protein factors levels were separately measured by qRT-PCR and Western blotting. The correlation analysis between the expression of stomatin-like protein 2 and the clinical/pathological features of liver cancer patients was assessed using the chi-squared test.
Kolmogorov-Smirnov test
Kolmogorov-Smirnov test statistics is the following: A'= l^-™^!«-^«!, (14)
where F1(x) and F2(x) are empirical distribution functions from two samples with sizes n1 and n2. Let's assume that the functions F1 (x) and F2 (x) are continuous.
Application of Kolmogorov-Smirnov test
1. Arrange the results of observations in ascending order: x1 < x2 < — < xn or represent them as an interval variational array.
2. Calculate the empirical relative frequencies for each
rank for distribution 1 by the formula: ^ = ~
where is the empirical frequency in the given rank; n1 is the number of observations in the sample.
3. Calculate the empirical relative frequencies for each rank for distribution 2 by the formula:
f = m27
where m2j- is the empirical frequency in the given rank; n2 is the number of observations in the sample.
4. Calculate the accumulated empirical relative frequencies for distribution 1 by the formula:
X f ij = X f i j-i + f j
where X f ij-i is the relative frequency accumulated in the
previous ranks; j is the order number of the rank; f1j is the relative frequency of the given rank.
5. Calculate the accumulated empirical relative frequencies for distribution 2 by the same formula.
X f 2 j = X f 2 j-i + f2 j
where X f 2 j-i is the relative frequency accumulated in the previous ranks; f2j is the relative frequency of the given rank.
6. Calculate the absolute differences between the accumulated relative frequencies for each rank. Designate them as d. Determine the largest absolute difference dmax.
7. Calculate A'exp by the formula:
^■exp d
n1-n2 n1+n2
(15)
where ni is the number of observations in the first sample; n2 is the number of observations in the second sample.
8. Using the table of critical values [9, 10, 11, 12], for a given significance level a, determine Acr. If A'exp > Acr, then the differences between the distributions are significant. If A'exp < Acr, then the differences between the distributions are not significant.
Limitations of Kolmogorov-Smirnov test
1. When comparing two empirical distributions, it is necessary that nb n2 > 50.
2. Ranks must be arranged in ascending or descending order by some indicator. We cannot accumulate frequencies by the ranks that differ only qualitatively and do not represent a scale of order.
Example. To evaluate the effectiveness of a drug, one group of subjects was given a test drug tested on animals, and the other group of subjects was given a placebo (a physiologically inert substance, the positive therapeutic effect of which is associated with the patient's subconscious psychological expectation). Table 14 represents data on the number of occurrences of influenza symptoms over a two-year period in the group taking prophylactic drug at the beginning of the period and in the group taking placebo [12].
Table 14. Given data for the problem
Number of diseases Number of patients taking the drug Number of patients taking placebo
mli m2j
0 32 26
1 26 30
2 15 11
3 6 14
4 and more 6 19
Sum 85 100
Can we state that at a significance level a< 0.05 the effect of the drug is sufficiently greater than of placebo?
Solution. Let's state the hypotheses:
H0: Empirical distribution 1 differs from empirical distribution 2, i.e. the effect of the drug significantly exceeds the effect of the placebo.
H^: Empirical distribution 1 does not differ from empirical distribution 2, i.e. the effect of the drug does not significantly exceed the effect of the placebo.
Let's determine empirical relative frequencies for each rank for sample 1 (first test) by the formula:
_ mij flj = ~
m11 32 fii=^- = 85 = 03765
ml7 26 fi2= — = ^ = 0.3059 ni
85
etc.
The results of the calculations are represented in Table 15.
Table 15. Calculation results
Number of diseases Empirical frequencies Empirical relative frequencies Accumulated empirical relative frequencies Difference If 1 j f 2 j
mlj m2j fu fv I f i j I f 2 j
0 32 26 0.3765 0.2600 0.3765 0.2600 0.1165
1 26 30 0.3059 0.3000 0.6824 0.5600 0.1224
2 15 11 0.1765 0.1100 0.8588 0.6700 0.1888
3 6 14 0.0706 0.1400 0.9294 0.8100 0.1194
4 and more 6 19 0.0706 0.1900 1.0000 1.000 0
Sum 85 100 1 1
r _ ¿.L ---
Let's determine empirical relative frequencies for each
rank for sample 2 (second test) by the formula:
m2j
f2j = ^ 26
— = 0.2600; n2 100
m77 30 f2l = — = ^ = 0.3000 '2l n2 100
etc.
Let's calculate the accumulated empirical relative frequencies for sample 1 by the formula:
X f i j= X f i j-i+fi j.
Yjfii=fu = 0.3765 ^fl2 = ^fii+fi2 = 0.3765 + 0.3059 = 0.6824 etc.
Let's calculate the accumulated empirical relative frequencies for sample 2 by the same formula:
X f 2 j = X f 2 j-1 + f2 j '
£/21 =/21 = 0.2600
£ /22 = £ /21 + /22 = 0.2 6 0 0 + 0.3 0 0 0 = 0.5 6 0 0
etc.
Let's determine the absolute differences between the accumulated empirical relative frequencies by the formula:
jX f 1 j -X f 2 j •
d1 = |y/11-y/21| = |0.3765 - 0.26001 = 0.1165; d2 = |£/12-£/22| = |0.6824-0.5600| = 0.1224; etc.
From Table 15, let's determine the largest absolute difference dmax. This is dmax = 0.1888 (highlighted in yellow).
Let's calculate Me
exp-
V — ii Aexp "max
Hi • n2
Hi +H2
= 0.1888 •
85 • 100
85 + 100
1.28
Using the table of critical values [9, 10, 11, 12], for a given significance level a < 0.05, let's determine 2cr = 1.36.
Let's plot the axis of significance:
Kxp - 1 -28
—1.36
Since A,'ex < then the null hypothesis is not r ejected, i. e. the effect oe the drug significantly exceeds the effect of the placebo.
In the studies [36-44], Kolmogorov-Smirnov test was used. The study [36] aimed to determine the relationship between the management of household solid waste (HSW) and non-household solid waste (NHSW) (X variable) in Huancavelica County and municipal government (Y variable) in 2016. The population and sample were 12,249 and 140 people, respectively. TTh^^ collected data were analyzed using Kolmogorov-Smirnovtest. The paper [37 ] rep resents the results ou phesicochemital and rheological studies of wet foams obtained from hen egg albumin with the addition oo xanth;m gum and/or arabic gum vising the b athh method. Physicochemical analysi s include d determi nation of noam density,gas phase volume fraction, overrnn, stability and distribution ou gas bubbles suspended in liquid. lhe stuTy of hydroco0cids cffeot on the di otributi on of ga-bubbles was based on standdro descriptide pscameter estimation and the use df the nonparan^nc Kolmogorov-Smirnov test. The study [38] evaluated the expression of basic fibroblast growth factor and the number of osteo-
blasts during orthodontic to oth movement after administration of BifidobacSerium bifidum probiotic in male Wistar rafts. Orthodontic tooth movement was carried out using a nicnel titanium coil spring with a force of 10 g applied between the first incisor and the maxillary first molar of a Wistar rat. Samples were then temovrd on daya 3, 7 end 14. Maxillary tissue was isolated for immunodiftoclfemical examinationand hem atoxylin-eosin stain ing. All d ata were analyzed using an independens t-test (p <0. (05), which wa-implemented based on Kolmogorox-Smienov test nnd I-ev-ene tese (p In the study [39] ,it was propos ed to us5 a
queuing ne twork to nimulate the di ffusion rf molecules in accordance with Fick's law. 1he prop osed model was tested uoing Kolmogorov-Smirnov tert to compare the results obtained from the simulation with the ehforetical standard deviations obtained based on Einstein-Smoluchowski test. The article [40] develops two dtfferent approaches to simulative di agnostic procedures for model s of Markov cheim based on bands. Lite first appro ach usev a eormal test based on Kolmogorov-Smimov or Cname^on Misre statistics.
The article [41] shows a study to determine the effect of consumption of roasted soyabeans and textured soy protein on the clinical and metabolic status of older women with borderline metabolit syndrome parameters. A rtnf dormzed single-blinded controllnd clinical trial included 75 womeh aged over 60 years with a diagnosis of metabolic syndrome based ore ATP III. Participants were randomly assigned to three groups of 225 pedple who consumeУ for 12 weeks: 1 ) goyabeans; 2) textured soy protein; and 3) control d!et. Faating blood samp lea wcre takew at the bogm-ning and en d of the study to compare metab olin respon ses. Kolmogorov-Smimov test, ANOVA, ANCOVA; paired t-test, and repeaten measurements analysis of tha general-iznd linear model were used to evaluate the study results. As n result oX tin study, it was eound that nutrition and physical tctivity of the participants in the two groups dice not differ diglnificantly. Aftee 122 wedks oe intervention, the soyabean-taeated participants shwwed significant reductions in total cholesterol (p <0.001), low-density lipoprotein, and very~low-density lipoprotwins. Thus, short-term consumption of roasted soyabeane and textured soy protein improves lipid profile, markers 01 glucose intolerance and ox-dative etress. Although roasted soyabeans were more effectixe than textured soy protein. Moderate daily intake of roasted sopabeans as a snack or textured soy protein as a food supplement for individuals with bo^diM metabolir syndrome parameters may be a safe and useful way to avoio disdase prognession. The work [42] was aimed at analyzing the consumption of sugar (sucrose) by the low-income population of Brazil. A cross-sectional descriptive stud was conducted o evaluate typical cus-tomurs of a popular restaurant (PR) in Brazil (Brazilian food dd prc^ram for low-income people). In the final sample, 1232 adult PR clients were interviewed. Exclusion criteria were pregnant women, diabetivs, or people on any specie- sucrose-restricted diet. People were enrohed at
lunchtime while they waited in line to pick up food. The invitation to participate were made to the first person in the queue, then to the 15th person, and so on until the sampling was complete. A three-day, 24-hour review was used to estimate sugar intake. Sociodemographic and anthropometric data were collected so that client profiles could be compiled. To characterize the sample, a statistical analysis of descriptive data (frequency, mean value, median, percentage and standard deviation) was carried out. Statistical normality tests (Kolmogorov-Smirnov test) were performed for all analyzes to test the assumptions of the statistical tests. The average total energy value (TEV) for the estimated three-day period was 1980.23 ± 726.75 kcal. A statistically significant difference was found between income groups (p < 0.01). The northern and northeastern regions have the lowest median income in Brazil, statistically different from the south (p < 0.01) and southeast (p < 0.01) regions. The northern region showed the lowest sugar consumption from industrial products, in contrast to the northeast (p = 0.007), southeast (p = 0.010) and south (p = 0.043) regions. The north region also has the lowest consumption of home-cooked foods among other regions (p < 0.001). Total sugar (sucrose) intake did not differ with body mass index (p = 0.321). There was no significant difference in sugar (sucrose) intake over the three days (p = 0.078). The addition of sugar (sucrose) contributed to 36.7% of all sugar (sucrose) intake, and sweetened beverages contributed to 22.53% of all sugar (sucrose) intake. Home-cooked products accounted for 20.06% of sugar (sucrose) consumption and industrial products accounted for 22.53% of sugar (sucrose) consumption. Thus, consumption of free sugar (sucrose) is still the largest contributor to total sugar (sucrose) intake, followed by sweetened beverages, especially on weekends. The average percentage of sugar (sucrose) intake exceeds the World Health Organization's recommendation of consuming less than 5% of total energy from sugars. Because this population group has a high percentage of overweight and obesity, sugar (sucrose) consumption may increase health outcomes by increasing public health costs.
The article [43] presents a study assessing the consumption of meat and products obtained from hunting by the consumer population. To do this, a survey was conducted on the frequency of eating meat from the four most representative species in Spain, two large species: wild boar (Sus scrofa) and red deer (Cervus elaphus), as well as two small species: rabbit (Oryctolagus cuuniculus) and red partridge (Alectis rufa), as well as processed meat products (salami sausages) made from the meat of these animals. The survey was conducted on 337 habitual consumers of these products. The overall average per capita meat consumption in this population group is 6.87 kg of meat per year or 8.57 kg of meat per year if processed meat products are also considered. The consumption of rabbit, red partridge, red deer,
and wild boar was 1.85, 0.82, 2.28, and 1.92 kg per person per year, respectively. Using probabilistic methods, distributions of meat consumption frequencies were estimated for each of the studied hunted species. The distribution of consumption frequencies was statistically proven by the chi-squared test and Kolmogorov-Smirnov test.
The aim of the study [44] was to describe the nutritional value of food and non-alcoholic beverages advertised in a lineup for children compared to a general lineup on two national private free-to-air television channels in Colombia. The methods chosen were: a cross-sectional descriptive study. The recording was made in July 2012 for four days randomly selected from 6:00 am to 12:30 pm. Nutrient content has been classified according to the Food Standards Agency nutrition profile criteria for nutrients indicating risk, the Pan-American Health Organization for trans fats, and Colombian Resolution 333 dated 2011, which classifies foods as a source of protective nutrients. Descriptive statistics was used, i. e. Kolmogorov-Smirnov test to establish normality and Pearson's chi-squared test to compare variables. The p value of < 0.05 was taken into account. As a result, the following data were obtained: 1560 advertising clips were shown in 52 hours of recording, of which 23.3% (364) clips advertised food and drinks, of which 56.3% were shown in a lineup for children. In terms of nutritional value, in the lineup for children, a high percentage of foods and non-alcoholic beverages classified as "rich" in sugar, sodium, saturated fats (69.0%, 56.0%, 57.1%) was noted, compared with the general lineup. In contrast, the percentage of foods and non-alcoholic beverages classified as "rich" in total fat content was higher in the general lineup (70.4% vs. 29.6%, respectively). Thus, in the lineup for children, a large impact of food and non-alcoholic beverage advertising was observed characterized by a high content of high-risk nutrients and a low content of foods.
Thus, the possibilities of nonparametric statistics are shown in the analysis of seemingly incomparable results.
Conclusion
The second part discusses nonparametric tests for testing hypotheses of distribution type and nonparamet-ric tests for testing hypotheses of sampling homogeneity. Pearson's chi-squared test, Kolmogorov-Smirnov test, Kol-mogorov test were reviewed. Using examples, the use of tests was discussed, and their capabilities and limitations were evaluated. Based on the literature review, brief descriptions of studies in which methods of nonparametric statistics have been successfully applied are given. These tests may be used when comparing descriptive characteristics, which allows statistical processing of the results, for example, tasting evaluation of the product or morphological analysis of the section. Nonparametric methods also allow to compare groups with different unequal number of parameters.
REFERENCES
1. Prokhorov, Yu.V. (2013). Nonparametric methods: Encyclopedia. Moscow: The Great Russian Encyclopedia. 2013. (In Russian)
2. Tomkins-Lane, C.C. (2006). An introduction to non-parametric statistics for health scientists. University of Alberta Health Sciences Journal, 3(1), 20-26.
3. Pett, M.A. (1997). Nonparametric statistics for health care research. London, Thousand Oaks, New Delhi: SAGE Publications, 1997.
4. Orlov, A.I. (2015). Current status of nonparametric statistics. Polythematic Online Scientific Journal of Kuban State Agrarian University, 106, 239-269. (In Russian)
5. Nikitina, M.A., Chernukha, I.M. (2021). Methods for nonpara-metric statistics in scientific research. Overview. Part 1. Theory and Practice of Meat Processing, 6(2), 151-162. https://doi. org/10.21323/2414-438X-2021-6-2-151-162
6. Kobzar, A.I. (2006). Applied mathematical statistics. Mos-kow: Fizmatlit, 2006. (In Russian)
7. Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302), 157175. https://doi.org/10.1080/14786440009463897
8. Chernoff, H., Lehmann, E. L. (1954). The Use of maximum likelihood estimates in tests for goodness of fit. The Annals of Mathematical Statistics, 25(3), 579-586. https://doi.org/10.1214/ aoms/1177728726
9. Gubler, E.V., Genkin, A.A. (1973). Application of nonparamet-ric statistical criteria in biomedical research. Leningrad: Medicine, 1973. (In Russian)
10. Rosenbaum, S. (1954). Tables for a nonparametric test of location. Annals of Mathematical Statistics, 25 (1), 146-150. https://doi.org/10.1214/aoms/1177728854
11. Stepanov, V.G. (2019). Application of nonparametric statistical methods in agricultural biology and veterinary medicine research. St-Petersburg: Lan. 2019. (In Russian)
12. Edelbaeva, N.A., Lebedinskaya, O.G., Kovanova, E.S., Teneto-va, E.P., Timofeev, A.G. (2019). Fundamentals of nonparametric statistics. Moscow: YUNITY-DANA. 2019. (In Russian)
13. Krupin, V.G., Pavlov, A.L., Popov, L.G. (2013). Higher mathematics. Probability theory, mathematical statistics, random processes. Moscow: Publishing House of the Moscow Power Engineering Institute, 2013. (In Russian)
14. Gmurman, V.E. (2004). A guide to solving problems in probability theory and mathematical statistics. Moscow: High School, 2004. (In Russian)
15. Petroczi, A., Naughton, D.P. (2009). Mercury, cadmium and lead contamination in seafood: A comparative study to evaluate the usefulness of Target Hazard Quotients. Food and Chemical Toxicology, 47(2), 298-302. https://doi.org/10.1016Zj.fct.2008.11.007
16. Sinclair, J., Dillon, S., Bottoms, L. (2022). Perceptions, beliefs and behaviors of nutritional and supplementary practices in inflammatory bowel disease. Sport Sciences for Health. https://doi. org/10.1007/s11332-022-00901-8 (unpublished data)
17. Lee, J.H., Singh, R.K. (1993). Residence time distribution characteristics of particle flow in a vertical scraped surface heat exchanger. Journal of Food Engineering, 18(4), 413-424. https://doi.org/10.1016/0260-8774(93)90055-O
18. Skorepa, L., Picha, K. (2016). Factors of purchase of bread — prospect to regain the market share? Acta Universitatis Agricul-turae et Silviculturae Mendelianae Brunensis, 64(3), 1067-1072. https://doi.org/10.11118/actaun201664031067
19. Lebedeva, E.V., Schastnyy, E.D., Nonka, T.G., Repin, A.N. (2021). Gender differences in self-reported social functioning of patients with chronic coronary artery disease and affective disorders. Byulleten Sibirskoy Meditsiny, 20(1), 75-82. https://doi. org/10.20538/1682-0363-2021-1-75-82
20. Romashkina, G.F., Khuziakhmetov, R.R. (2020). The risks of internet addiction: structure and characteristics of perception. Education and Science Journal, 22(8), 108-134. https://doi. org/10.17853/1994-5639-2020-8-108-134 (In Russian)
21. Snijder, R., Bosman, B., Stroosma, O., Agema, M. (2020). Relationship between mean volume voided and incontinence in children with overactive bladder treated with solifenacin: post hoc analysis of a phase 3 randomised clinical trial. European Journal of Pediatrics, 179(10), 1523-1528. https://doi.org/10.1007/ s00431-020-03635-2
22. Malhotra, D., Nour, N., El Halik, M., Zidan, M. (2020). Performance and analysis of pediatric index of mortality 3 score in a
Pediatric ICU in Latifa Hospital, Dubai, UAE. Dubai Medical Journal, 3(1), 19-25. https://doi.org/10.1159/000505205
23. Wong, M.Y., Yang, Y.S., Cao, Z.Q., Guo, V.Y.W., Lam, C.L.K., Wong, C.K.H. (2018). Effects of health-related quality of life on health service utilisation in patients with colorectal neoplasms. European Journal of Cancer Care, 27(6), Article e12926. https://doi.org/10.1111/ecc.12926
24. Calderon-Gerstein, W., Bruno-Huaman, A., Damian-Mucha, M., Huayllani-Flores, L. (2021). Predictive capacity of the brain natriuretic peptide in the screening of heart failure in a high altitude population. Respiratory Physiology and Neurobiology, 289, Article 103654. https://doi.org/10.1016/j.resp.2021.103654
25. Ray, S., Bhattacharyya, B. (2020). Statistical modeling and forecasting of arima and arimax models for food grains production and net availability of India. Journal of Experimental Biology and Agricultural Sciences, 8(3), 296-309. https://doi. org/10.18006/2020.8(3).296.309
26. Kadekova, Z., Recky, R., Nagyova, L., Kosiciarova, I., Holiencinova, M. (2017). Consumers' purchasing preferences towards organic food in Slovakia. Potravinarstvo Slovak Journal of Food Sciences, 11(1), 731-738. https://doi.org/10.5219/846
27. Bahk, G.-J. (2010). Statistical probability analysis of storage temperatures of domestic refrigerator as a risk factor of food-borne illness outbreak. Korean Journal of Food Science and Technology, 42(3), 373-376.
28. Hu, M., Ji, X., Liu, G.M. (2021). On the strong Markov property for stochastic differential equations driven by G-Brownian motion. Stochastic Processes and their Applications, 131, 417-453. https://doi.org/10.1016/j.spa.2020.09.015
29. Poroshin, V., Shlishevsky, A., Tsybulya, K. (11-15 September 2017). Development of a model of a homogeneous continuous medium based on the material with defects in the form of hollows. International conference on modern trends in manufacturing technologies and equipment (ICMTMTE2017), 129, Article 02017. Sevastopol, Russia. https://doi.org/10.1051/matecco-nf/201712902017
30. Lupandin, V.I. (2009) Mathematical methods in psychology. Ekaterinburg: Ural University Publishing House, 2009. (In Russian)
31. Badshah, Y., Shabbir, M., Hayat, H., Fatima, Z., Burki, A., Khan, S. et al. (2021). Genetic markers of osteoarthritis: early diagnosis in susceptible Pakistani population. Journal of Orthopaedic Surgery and Research, 16(1), Article 124. https://doi.org/10.1186/ s13018-021-02230-x
32. Alves-Ferreira, M., Quintas, M., Sequeiros, J., Sousa, A., Pereira-Monteiro, J., Alonso, I. et al. (2021). A genetic interaction of NRXN2 with GABRE, SYT1 and CASK in migraine patients: a case-control study. Journal of Headache and Pain, 22(1), Article 57. https://doi.org/10.1186/s10194-021-01266-y
33. Do Nascimento, I.D.S., Pastor, A.F., Lopes, T.R.R., Farias, P.C.S., Goncales, J.P., Do Carmo, R.F. et al. (2020). Retrospective cross-sectional observational study on the epidemiological profile of dengue cases in Pernambuco state, Brazil, between 2015 and 2017. BMC Public Health, 20(1), Article 923. https://doi. org/10.1186/s12889-020-09047-z
34. Plichta, M., Jezewska-Zychowicz, M. (2019). Eating behaviors, attitudes toward health and eating, and symptoms of or-thorexia nervosa among students. Appetite, 137, 114-123. https://doi.org/10.1016/j.appet.2019.02.022
35. Zhu, W.Y., Li, W., Geng, Q., Wang, X.Y., Sun, W., Jiang, H. et al. (2018). Silence of stomatin-like protein 2 represses migration and invasion ability of human liver cancer cells via inhibiting the nuclear factor kappa B (NF-kB) pathway. Medical Science Monitor, 24, 7625-7632. https://doi.org/10.12659/MSM.909156
36. Espinoza-Quispe, C.E., Marrero-Saucedo, F.M., Bena-vides, R.A.H. (2021). Solid Waste Management in the County of Huancavelica, Peru. Letras Verdes, 28, 163-177. https://doi. org/10.17141/letrasverdes.28.2020.4269
37. Kruk, J., Ptaszek, P., Kaczmarczyk, K. (2021). Technological aspects of xanthan gum and gum Arabic presence in chicken egg albumin wet foams: Application of nonlinear rheology and non-parametric statistics. Food Hydrocolloids, 117, Article 106683. https://doi.org/10.1016/j.foodhyd.2021.106683
38. Triwardhani, A., Anggitia, C., Ardani, I.G.A.W., Nugraha, A.P., Riawan, W. (2021). The increased basic fibroblast growth factor expression and osteoblasts number post Bifidobacterium bifidum probiotic supplementation during orthodontic tooth movement in Wistar rats. Journal of Pharmacy and Pharmacognosy Research, 9(4), 446-453.
39. Honary, V., Nitz, M., Wysocki, B.J., Wysocki, T.A. (2019). Modeling 3-D diffusion using queueing networks. Biosystems, 179, 17-23. https://doi.org/10.1016/j.biosystems.2018.12.006
40. Huang, X.-W., Emura, T. (2019). Model diagnostic procedures for copula-based Markov chain models for statistical process control. Communications in Statistics: Simulation and Computation, 50(8), 2345-2367. https://doi.org/10.1080/03610918.2019.! 602647
41. Bakhtiari, A., Hajian-Tilaki, K., Omidvar, S., Nasiri-Amiri, F. (2019). Clinical and metabolic response to soy administration in older women with metabolic syndrome: A randomized controlled trial. Diabetology and Metabolic Syndrome, 11, Article 47. https://doi.org/10.1186/s13098-019-0441-y
42. Botelho, R.B.A., De Cassia Akutsu, R., Zandonadi, R.P. (2019). Low-income population sugar (Sucrose) intake: A cross-sectional
study among adults assisted by a brazilian food assistance program. Nutrients, 11(4), Article 798. https://doi.org/10.3390/ nu11040798
43. Sevillano Morales, J., Moreno-Ortega, A., Amaro Lopez, M.A., Arenas Casas A., Cámara-Martos, F., Moreno-Rojas, R. (2018). Game meat consumption by hunters and their relatives: a probabilistic approach. Food Additives and Contaminants — Part A Chemistry, Analysis, Control, Exposure and Risk Assessment, 35(9), 1739-1748. https://doi.org/10.1080/19440049.2018.1 488183
44. Mejía-Díaz, D.M., Carmona-Garcés, I.C., Giraldo-López, P.A., González-Zapata, L. (2014). Nutritional content of food, and nonalcoholic beverages advertisements broadcasted in children's slot of Colombian national television. Nutricion Hospitalaria, 29(4), 858-864. https://doi.org/10.3305/nh.2014.29.4.7214
AUTHOR INFORMATION
Marina A. Nikitina, Candidate of technical sciences, docent, leading scientific worker, the Head of the Direction of Information Technologies of the Center of Economic and Analytical Research and Information Technologies, V. M. Gorbatov Federal Research Center for Food Systems. 26, Talalikhina str., 109316, Moscow, Russia. Tel: +7-495-676-95-11 extension 297, E-mail: [email protected] ORCID: https://orcid.org/0000-0002-8313-4105 * corresponding author
Irina M. Chernukha, Doctor of technical sciences, professor, Academician of the Russian Academy of Sciences, Head of the Department for Coordination of Initiative and International Projects, V. M. Gorbatov Federal Research Center for Food Systems. 26, Talalikhina, 109316, Moscow, Russia. Tel: +7-495-676-95-11 extension 109, E-mail: [email protected] ORCID: https://orcid.org/0000-0003-4298-0927
All authors bear responsibility for the work and presented data.
All authors made an equal contribution to the work.
The authors were equally involved in writing the manuscript and bear the equal responsibility for plagiarism. The authors declare no conflict of interest.