Научная статья на тему 'Lower bounds for the expected sample size in the classical and d-posterior statistical problems'

Lower bounds for the expected sample size in the classical and d-posterior statistical problems Текст научной статьи по специальности «Математика»

CC BY
83
15
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
EXPECTED SAMPLE SIZE / LOWER BOUNDS / EFFICIENCY / D -POSTERIOR APPROACH / BAYESIAN PARADIGM / HYPOTHESIS TESTING / СРЕДНИЙ ОБЪЁМ ВЫБОРКИ / НИЖНИЕ ГРАНИЦЫ / ЭФФЕКТИВНОСТЬ / D-АПОСТЕРИОРНЫЙ ПОДХОД / БАЙЕСОВСКАЯ ПАРАДИГМА / ПРОВЕРКА ГИПОТЕЗ

Аннотация научной статьи по математике, автор научной работы — Kareev Iskander Amirovich, Volodin Igor Nikolaevich

В работе рассмотрена проблема построения нижних границ для среднего объёма наблюдений для процедур статистического вывода. Приведены общая методология построения нижних границ и обзор основных результатов, полученных для классических статистических задач. Представлены новые результаты по адаптации этой методологии к задачам, сформулированным согласно d -апостериорному подходу. В частности, рассмотрена задача построен ия проверки сложной гипотезы в d -апостериорной формулировке.I

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

n this report, the problem of construction of lower boundaries for the expected sample size of statistical inference procedures has been considered. The general methodology for construction of the lower bounds and the review of the main results for the classical statistical problems have been presented along with the analysis of the new and earlier results on adoption of the technique to the d -posterior approach. Namely, the hypothesis testing problem has been considered.

Текст научной работы на тему «Lower bounds for the expected sample size in the classical and d-posterior statistical problems»

2018, Т. 160, кн. 2 С. 309-316

УЧЕНЫЕ ЗАПИСКИ КАЗАНСКОГО УНИВЕРСИТЕТА. СЕРИЯ ФИЗИКО-МАТЕМАТИЧЕСКИЕ НАУКИ

ISSN 2541-7746 (Print) ISSN 2500-2198 (Online)

UDK 519.226.3

LOWER BOUNDS FOR THE EXPECTED SAMPLE SIZE

IN THE CLASSICAL AND d-POSTERIOR STATISTICAL

PROBLEMS

I.A. Kareev, I.N. Volodin

Kazan Federal University, Kazan, 420008 Russia Abstract

In this report, the problem of construction of lower boundaries for the expected sample size of statistical inference procedures has been considered. The general methodology for construction of the lower bounds and the review of the main results for the classical statistical problems have been presented along with the analysis of the new and earlier results on adoption of the technique to the d-posterior approach. Namely, the hypothesis testing problem has been considered.

Keywords: expected sample size, lower bounds, efficiency, d-posterior approach, Bayesian paradigm, hypothesis testing

Introduction

In mathematical statistics, we have some inequalities determining lower boundaries for various components of the risk functions of estimating and hypotheses distinguishing procedures. Rao-Cramer's inequality is the most famous one. It determines a lower boundary for the estimation variance constricted by a sample with the fixed size when the distribution of an observed random variable satisfies certain regularity conditions. Various generalizations and modifications of this inequality were developed by A. Bhattachryya, L. Bolshev, E. Barankin, J. Chipman, H. Robbins, et al. J. Wolfowitz generalized Rao-Cramer's inequality for sequential sampling.

Another well-known inequality was introduced by A. Wald. He determined a lower boundary for the expected sample size in any sequential procedure regarding the problem of distinguishing between two simple hypotheses with the given limits on the probabilities of type-I and type-II errors. W. Hoeffding and G. Simons generalized this inequality for the case of distinguishing between more than two hypotheses (see [1]). Later, in the 1960s-1880s, I. Volodin [2-12], as well as some other authors, have established several analogous inequalities for the expected total sample size in the problems of hypothesis testing, classification, selection, etc. The essential similarity of all these inequalities is that they are only simple implications of a single important property of the Kullback-Leibler divergence: data contained in the statistic set do not exceed those contained in the sample.

Several uses can be distinguished for such lower boundaries:

1) they can be used as a robust criterion of sample size insufficiency — if the expected sample size is less than the lower boundary, then there is no appropriate procedure for solving the statistical problem with the given limits on the risks;

2) they can be used to measure the efficiency of existing procedures by comparing their needed sample size to some theoretical optimal one;

3) they can be used as some another measure of difficulty of a problem.

This paper provides an overview of the obtained lower boundaries for the average sample size with regard to many classical problems of mathematical statistics (section 1) and presents the new and earlier results on adaptation of the lower bounds construction methods for Bayesian problems, namely on hypothesis testing in the d-posterior approach (in section 2).

1. Volodin's lower bounds in the general form and their applications for the classical statistical problems

In his earliest investigations, I.N. Volodin introduced a general method for the construction of lower boundaries for the expected sample size of statistical inference procedures [5]. The method allows to obtain the closed form of lower boundaries for a wide range of statistical problems. Here, we provide Malyutov's modification [10] of that method, which gives more precise lower bounds in problems when several independent populations are involved.

1.1. The lower bound in the general form (see [5, 10]). Let us denote the Kullback-Leibler divergence by

KL(Fi,F2) = y ln dF- dFi

for some distributions F1 and F2. When F1 = F(9), F2 = F(■&) (i.e., they coincide up to the value of parameter 9) we denote KL(9,0; F) = KL(F(9),F(■&)).

Let us consider the general problem of statistical inference where we observe m populations Xi,..., Xm independently. The lower boundary for the expected total sample size v = vi + • • • + vm is given by the inequality:

m

E(v | 9) > inf sup KL(9, 0; Sf) / V nfKL(9, 0; Xi) V 9 e Q, (1)

where Sf is a random variable denoting the decision made by the procedure p after

m

the experiment is over; ni(9) = E(vf^^^ vf) is the expected ratio of observations,

j=i

which p takes from the i-th population.

When domS e {do, di} (a bivalued random variable), then

KL(9, $ ; Sf)= w( V(do | 9), 1 - do | 0)),

where

x 1 — x

w(x,y) = xln--+ (1 - x) ln-, ^(d | 9) = P(Sf = d | 9).

1 - y y

1.2. Multiple simple hypotheses testing (see [2]). Consider the problem of distinguishing between m > 3 simple hypotheses

Hi : 9 = 9i, i = 1,... ,m,

about the distribution of a population X ~ F(9), 9 e Q C R. For this problem, the inequality (1) gives us

r

v > max ak ln(aifc/ajk WKL(9j, 9j; X), i = 1,...,m, j=i ' / k=i

where ||aij || = || ^(dj | 9i) || is a matrix consisting of values of the operating characteristic (the strength of the procedure p).

1.3. Goodness-of-fit test (see [6]). Let F be a family of mutually absolutely continuous distributions F in the same measurable space. We consider the problem of testing the null hypothesis (for some A > 0)

Ho: F = Fo against Hi: sup \F(A) - Fo(A)\ > A

AeA

about the distribution F of a population X with the given limitation ao, ai on probabilities of type-I and type-II errors. Here and elsewhere, the A is the algebra of the problem's probabilistic space (X, A).

For this problem, we obtained the lower boundaries on the expected sample size when Hi is true:

E(HF c Hi) > ■

When Ho is true, the lower bound is:

E(v I Pa H ) > w(a0,a1) = w(a0,al)_

E(V 1 F C Ho) > KL(Fo,F) = -h(1/2 - 2A/3) - C ■ A8'

where

p + A 1 -p- A

h(p) = p ln--+ (1 - p) ln —--

p 1 - p

and

1024 / 8A6 0 < C < —— 1 +

364^ 91

1.4. Homogeneity test (see [6]). Let F be a family of mutually absolutely continuous distributions F on the same measurable space. Let Xi ~ Fi and X2 ~ F2 be the population which can be observed in an arbitrary way, so v = vi +V2 . We consider the problem of testing the null hypothesis (for some A > 0)

Ho: Fi = F2 c Fo against Hi: sup \Fi (A) - F2 (A) \ > A

AeA

subject to the limitations ao, ai on probabilities of type-I and type-II errors.

For this problem, there is the following adaptation of the general lower boundary (1). When Ho is true:

E(v|Ho) i-nrr-A) • (2>

When Hi is true:

E(v \Hi) > -Mai,ao)1 + A • (3)

ln(1 - A2) + Aln—^— 1A

1.5. Test for invariance to a group of transformations (see [7]). Let G be

a group of transformations and consider two sets of distributions:

Fo = {F: F(A) = F(gA) VA c A, Vg c G},

Fi = { F: 3g c G, 3Ao CA sup \ F(A) - F(gA)\ > A }, A > 0^ L AeAo J

The problem of invariance to a group of transformation consists in testing the null

hypothesis

Ho : F c Fo against Hi : F c Fi about the distribution F of the population subject to the limits ao, ai on the risk.

Surprisingly, the adaptation of the general lower boundary (1) yields the same form as of the lower boundaries for the homogeneity testing problem (2), (3).

1.6. Selection problem (see [13—15]). Let us have m > 3 populations Xi ~ N (0i,a2), i = 1,...,m with the same known a2. The problem is to select the population with the highest value of 0, i.e., to select one of the hypotheses

Hi: 0i > 0j + A V j = i, A > 0,

subject to the limit a on the wrong decision probability. The populations might be

m

observed in an arbitrary way, so the total sample size v ^^ vi.

i=i

The lower boundary for the least favorable case:

(J m — 1 + 1)2

sup E(v I 0i,...,0m) > —--- a2w(a,a).

e1,...,em 2A2

1.7. Ranking problem (see [14, 16]). Let us have m > 3 populations Xi ~ N (0i, a2), i = 1,... ,m with the same known a2. It is known that I0i — 0j | > A, A > 0. The problem is to place the population in an ascending order of values 0i subject to the limit a on the probability of wrong ordering. The populations can be

m

observed in an arbitrary way, so the total sample size v = vi .

i=i

The lower bound for the least favorable case:

sup E(v | 01,.. ., 0m) > ——— a2w(a, a). e,.....e„ A2

2. The lower bounds for hypotheses testing in the d-posterior approach

We consider the following Bayesing problem. Let X ~ F(0), where 0 G © C R is the unknown random parameter of interest; let 0 ~ G. The problem is to distinguish between the null hypotheses

Ho : 0 G ©o, Hi: 0 G ©i

based on the observations from X, where ©0 + ©1 = ©. Let us put ©0 = (—rc>, 0], ©1 = (0, to) .

Let dom Sv G {do,di}, where do denotes the selection of Ho by a procedure p after an experiment, and di is the selection of Hi. In the d-posterior approach, type-I (on the left) and type-II (on the right) d-risks are considered:

P(0 < 0| S = di), P(0> 0| S = do).

Type-I d-risk is a probability of that Ho is correct among all experiments for which the procedure p selected Hi by the results of the experiment. On the contrary, type-II d-risk is a probability of that Hi is correct among all experiments for which Ho was selected.

For the considered hypotheses, the testing problem we bring in the constraints - the type-I and type-II d-risk must be less than the prescribed limits f3o and f3i:

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

P(0 < 0| S = di) < /3o, P(0> 0| S = do) < ¡3i. (4)

We suggest the following lower boundary for the expected sample size when Ho is true:

G- J Ee v dG(0) > L,

Go

eeOo

where

*cki dG^KKLm (5>

eeeo tfeQi

Gk = G(©k), and inf is taken subject to restrictions (4) on the d-risks for p. Note v

that

KL(e, s) = w(w), 1 - m) = m ln + (1 - m) ln ,

where 0(e) = P(S = do \ e). Namely, the lower bound depends on the procedure p only through its operative characteristic. Thus, we can consider inf by 0 rather than inf

by P.

Lemma 1. The constrains on d-risks

P(tf < 0 \ S = di) = /3o, P($> 0 \ S = do) = ¡3i. are equivalent to the following constraints on the "Bayesian" risks:

J 0(e) dG(e) = ao, J 0(e) dG(e) = ai; (6)

eo e1

where

(1 - 3i)(Go - 3o) 3i(Go - 3o) (7)

ao = 1 - 3o - 3i ' ai = 1 - 3o - 3i (7)

Proof. We put

0k = J ^(e) dG(e), k = 0,1.

efc

The equations on d-risks can be rewritten in terms of 0k as

Go - 0o n 0i „ 3o, -—-— = 3i,

1 - 0o - 0i 0o + 0i

from which we easily obtain the statement of the lemma. □

By swapping the order of inf and the integration, we obtain the basic estimate for the lower boundary.

Theorem 1. Let us suppose that the inf in (5) is reached on the procedure with the non-increasing operating characteristic 0(e). Then,

L > gG f dGe f W) > hiW £)"> .

eeeo fleei

where

= G-m • hi« = •

and ao, ai are as in (7):

= (1 - 3i)(Go - 3o) a = 3i(Go - 3o)

ao 1 - 3o - 3i ' ai 1 - 3o - 3i •

Proof. Let us consider a set H of all non-increasing functions h(0), such that h(—to) = 1, h(TO) = 0. Then,

L

1

GoGi

dG(0) dG(0)

eeOo

infhen w(h(0), 1 — h(0)) KL(0,0; X)

For fixed 0 G ©o,0 G ©i, the minimum by h(0) of w(h(0), 1 — h(0)) subject to the constrains (6) is reached on a step-function of form

h(x) = <

1,

Уo, У1, 0,

x < 0;

0 <x < 0; 0 < x < 0; 0 < x.

Now, minimizing the expression by yo, yi gives the statement of the theorem.

Conclusions

In this paper, the results of the studies on the problems of construction of lower bounds for the expected sample size of statistical inference procedures are presented. As the review shows, the problem of construction of lower bounds for the classical statistical problems is sufficiently developed.

On the other hand, little has been done for the case of the Bayesian paradigm. In this paper, we present some new basic results on adaptation of the technique of lower bounds construction to the d-posterior approach. As the study shows, the construction of lower boundaries in this setting can involve solving integral minimization. Apparently, one of the best approaches is to apply calculus of the variations methods. Another one is to provide some additional assumptions and simplifications.

Acknowledgements. This work was funded by the subsidy allocated to Kazan Federal University for the state assignment in the sphere of scientific activities (project no. 1.7629.2017/8.9).

The work is performed according to the Russian Government Program of Competitive Growth of Kazan Federal University.

References

1. Simons G. Lower bounds for average sample number of sequential myltihypothesis tests. Ann. Math. Stat., 1967, vol. 38, no. 5, pp. 1343-1364.

2. Volodin I.N. Estimates of the necessary sample size in problems of statistical classifications. II. Theory Probab. Its Appl., 1977, vol. 22, no. 4, pp. 730-745. doi: 10.1137/1122086.

3. Volodin I.N. Optimum sample size in statistical inference procedures. Izv. Vyssh. Uchebn. Zaved.., Mat., 1978, vol. 21, no. 12, pp. 33-45. (In Russian)

4. Volodin I.N. Bounds for the necessary sample size in statistical classification problems. I. Theory Probab. Its Appl., 1977, vol. 22, no. 2, pp. 339-348. doi: 10.1137/1122037.

5. Volodin I.N. Lower bounds for average sample size and efficiency of procedures of statistical inference procedures. Theory Probab. Its Appl., 1979, vol. 24, no. 1, pp. 120-129. doi: 10.1137/1124009.

6. Volodin I.N. Lower bounds for the mean sample size in goodness-of-fit and homogeneity tests. Theory Probab. Its Appl., 1980, vol. 24, no. 3, pp. 640-649. doi: 10.1137/1124079.

7. Volodin I.N. Lower bounds for the mean sample size in invariance tests. Theory Probab. Its Appl., 1980, vol. 25, no. 2, pp. 3569-360. doi: 10.1137/1125043.

8. Volodin I.N. Lower bounds for sample size sufficient for procedures of guaranteed equiva-riant estimation. Sov. Math., 1982, vol. 26, no. 3, pp. 15-20.

9. Khamdeev I.I. Lower bounds for sufficient sample size in procedures for guaranteed equi-variant estimation (the case of general transformation group). Sov. Math., 1983, vol. 27, no. 11, pp. 101-104.

10. Malutov M.B. Lower bounds for mean duration of a sequentially programmable experiment. Sov. Math., 1983, vol. 27, no. 11, pp. 21-47.

11. Volodin I.N. Guaranteed statistical inference procedures (determination of the optimal sample size). J. Sov. Math., 1989, vol. 44, no. 5, pp. 568-600.

12. Galtchouk L.I., Maljutov M.B. One bound for the mean duration of sequential testing homogeneity. In: Kitsos C.P., Muller W.G. (Eds.) MODA 4 - Advances in Model-Oriented Data Analysis. Contributions to Statistics. Heidelberg, Physica. 1995, pp. 49-56. doi: 10.1007/978-3-662-12516-8_5.

13. Kareev I.A. Lower bounds for average sample size and efficiency of sequential selection procedures. Theory Probab. Its Appl., 2013, vol. 57, no. 2, pp. 227-242. doi: 10.1137/S0040585X97985935.

14. Kareev I. Lower bounds for the expected sample size of sequential procedures for selecting and ranking of binomial and Poisson populations. Lobachevskii J. Math., 2016, vol. 37, no. 4, pp. 455-465. doi: 10.1134/S1995080216040119.

15. Kareev I. Lower bounds for expected sample size of sequential procedures for the multinomial selection problems. Commun. Stat.: Theory Methods, 2017, vol. 46, no. 19, pp. 1-8. doi: 10.1080/03610926.2016.1222429.

16. Kareev I.A. Lower bound for the average sample size and the efficiency of ranking sequential procedures. Theory Probab. Its Appl., 2014, vol. 58, no. 3, pp. 503-509. doi: 10.1137/S0040585X97986709.

Received

December 14, 2017

Kareev Iskander Amirovich, Candidate of Physical and Mathematical Sciences, Associate Professor of the Department of Mathematical Statistics Kazan Federal University

ul. Kremlevskaya, 18, Kazan, 420008 Russia E-mail: kareevia@gmail.com

Volodin Igor Nikolaevich, Doctor of Physical and Mathematical Sciences, Professor of the Department of Mathematical Statistics Kazan Federal University

ul. Kremlevskaya, 18, Kazan, 420008 Russia

УДК 519.226.3

Нижние границы для среднего объёма выборки для классических и й-апостериорных задач

И.А. Кареев, И.Н. Володин

Казанский (Приволжский) федеральный университет, г. Казань, 420008, Россия

Аннотация

В работе рассмотрена проблема построения нижних границ для среднего объёма наблюдений для процедур статистического вывода. Приведены общая методология построения нижних границ и обзор основных результатов, полученных для классических статистических задач. Представлены новые результаты по адаптации этой методологии к задачам, сформулированным согласно й-апостериорному подходу. В частности, рассмотрена задача построения проверки сложной гипотезы в й-апостериорной формулировке.

Ключевые слова: средний объём выборки, нижние границы, эффективность, й-апостериорный подход, Байесовская парадигма, проверка гипотез

Поступила в редакцию 14.12.17

Кареев Искандер Амирович, кандидат физико-математических наук, доцент кафедры математической статистики

Казанский (Приволжский) федеральный университет

ул. Кремлевская, д. 18, г. Казань, 420008, Россия E-mail: kareevia@gmail.com

Володин Игорь Николаевич, доктор физико-математических наук, профессор кафедры математической статистики

Казанский (Приволжский) федеральный университет ул. Кремлевская, д. 18, г. Казань, 420008, Россия ]

/ For citation: Kareev I.A., Volodin I.N. Lower bounds for the expected sample size in the classical and d-posterior statistical problems. Uchenye Zapiski Kazanskogo \ Universiteta. Seriya Fiziko-Matematicheskie Nauki, 2018, vol. 160, no. 2, pp. 309-316.

Для цитирования: Kareev I.A., Volodin I.N. Lower bounds for the expected sample size in the classical and d-posterior statistical problems // Учен. зап. Казан. ун-та. Сер. Физ.-матем. науки. - 2018. - Т. 160, кн. 2. - С. 309-316.

i Надоели баннеры? Вы всегда можете отключить рекламу.