Научный
отдел
МАТЕМАТИКА
Известия Саратовского университета. Новая серия. Серия: Математика. Механика. Информатика. 2023. Т. 23, вып. 4. С. 422-434 Izvestiya of Saratov University. Mathematics. Mechanics. Informatics, 2023, vol. 23, iss. 4, pp. 422-434 mmi.sgu.ru
https://doi.org/10.18500/1816-9791-2023-23-4-422-434 EDN: ANLRAB
Article
Wasserstein and weighted metrics for multidimensional Gaussian distributions
M. Y. Kelbert10, Y. Suhov2
1 Higher School of Economics — National Research University, 20 Myasnitskaya St., Moscow 101000, Russia
2DPMMS, Penn State University, 201 Old Main, State College, PA 16802, USA
Mark Y. Kelbert, [email protected], https://orcid.org/0000-0002-3952-2012, AuthorlD: 1137288
Yurii Suhov, [email protected], AuthorlD: 1131362
Abstract. We present a number of low and upper bounds for Levy-Prokhorov, Wasserstein, Frechet, and Hellinger distances between probability distributions of the same or different dimensions, The weighted (or context-sensitive) total variance and Hellinger distances are introduced. The upper and low bounds for these weighted metrics are proved. The low bounds for the minimum of different errors in sensitive hypothesis testing are proved. Keywords: Levy - Prokhorov distance, Wasserstein distance, weighted total variance distance, Dobrushin's inequality, weighted Pinsker's inequality, weighted le Cam's inequality, weighted Fano's inequality
Acknowledgements: This research is supported by the Russian Science Fund (project No. 23-21-00052) and the HSE University Basic Research Program.
For citation: Kelbert M. Y., Suhov Y. Wasserstein and weighted metrics for multidimensional Gaussian distributions. Izvestiya of Saratov University. Mathematics. Mechanics. Informatics, 2023, vol. 23, iss. 4, pp. 422-434. https://doi.org/10.18500/1816-9791-2023-23-4-422-434, EDN: ANLRAB
This is an open access article distributed under the terms of Creative Commons Attribution 4.0 International License (CC-BY 4.0)
Научная статья УДК 519.85
Метрика Вассерштейна и взвешенные метрики для многомерных распределений Гаусса
М. Я. Кельберт10, Ю. Сухов2
1 Национальный исследовательский университет «Высшая школа экономики», Россия, 101000, г. Москва, ул. Мясницкая, д. 20
2Университет штата Пенсильвания, Соединенные Штаты Америки, Пенсильвания, 16802, г. Стейт-Колледж, кампус Юниверсити-Парк, ул. Олд Мейн, д. 201
Кельберт Марк Яковлевич, кандидат физико-математических наук, профессор-исследователь департамента статистики и анализа данных факультета экономических наук, [email protected], https: //orcid.org/0000-0002-3952-2012, AuthorlD: 1137288
Сухов Юрий, кандидат физико-математических наук, профессор математического факультета, [email protected], AuthorlD: 1131362
Аннотация. Приводится ряд нижних и верхних оценок для расстояний Леви - Прохорова, Вассерштейна, Фреше и Хеллингера между вероятностными распределениями одной и той же или разных размерностей. Вводится взвешенное (или контекстно зависимое) расстояние полной вариации и расстояние Хеллингера. Доказаны верхняя и нижняя оценки для этих взвешенных метрик. Доказаны нижние оценки минимума суммы различных ошибок при проверке чувствительных гипотез.
Ключевые слова: расстояние Леви - Прохорова, расстояние Вассерштейна, взвешенное расстояние полной вариации, неравенство Добрушина, взвешенное неравенство Пинскера, взвешенное неравенство Ле Кама, взвешенное неравенство Фано
Благодарности: Исследование выполнено при поддержке Российского научного фонда (проект № 23-21-00052) и Программы фундаментальных исследований Университета НИУ ВШЭ. Для цитирования: Kelbert M. Y., Suhov Y. Wasserstein and weighted metrics for multidimensional Gaussian distributions [Кельберт М. Я., Сухов Ю. Метрика Вассерштейна и взвешенные метрики для многомерных распределений Гаусса] // Известия Саратовского университета. Новая серия. Серия: Математика. Механика. Информатика. 2023. Т. 23, вып. 4. С. 422-434. https://doi.org/10.18500/1816-9791-2023-23-4-422-434, EDN: ANLRAB
Статья опубликована на условиях лицензии Creative Commons Attribution 4.0 International (CC-BY 4.0)
Introduction
In this note, we review basic facts about the metrics for probability measures and provide specific formulae and simplified proofs that could not be easily found in the literature. Alongside the classical results such as the evaluation of the Levy - Prokhorov distance in terms of the Wasserstein distance presented in Section 1, we discuss some novel approaches. In Section 2, we review a recent development related to the distances between the distributions of different dimensions. Finally, in Section 3, we present the context-sensitive (or weighted) total variance distance and establish a number of new inequalities mimicking some classical results from the information theory. Sections 1 and 2 of the paper are basically a review but contain several improvements. Section 3 is purely original and was never published before.
1. Levy - Prokhorov and Wasserstein distances
Let Pj, i = 1, 2, be probability distributions on a metric space W with metric r. Define the Levy - Prokhorov distance pL-p(Pi, P2) between Pi, P2 as the infimum of numbers e > 0 such that for any closed set C C W,
Pi(C) — P2(Ce) <6, P2(C) - Pi(Ce) <6 (1)
where Ce stands for the ^-neighborhood of C in metric r. It could be easily checked that pL-p(P1,P2) ^ t(P1,P2), i.e. the total variance distance. Next, define the Wasserstein distance (Pi, P2) between Pi, P2 by
w; (Pi, P2) = inf (Ep [r(Xi ,X2y ])i/p
where the infimum is taken over all joint P on W x W with marginals P^. In the case of Euclidean space with r(xi,x2) = \\xi — x21|, the index r is omitted.
Theorem 1 (Dobrushin's bound).
pL-p(Pi,P2) < [W[(Pi,P2)]i/2. (2)
Proof. Suppose that there exists a closed set C for which at least one of the inequalities (1) fails, say Pi(C) ^ e + P2(Ce). Then, for any joint P with marginals Pi and P2,
Ep [r(XbX2)] ^ Ep [1(r(^,X2) ^ e)r(XuX2)] > eP(r(X1,X2) > s) > ^ eP(X! e C,X2 e w \ ce) [P(Xi e C) — P(Xi e C,X2 e C£)] ^ ^ 6 [P(^ e C) — P(X2 e Ce)] = 6 [Pi(Xx e C) — P2(X2 e C£)] ^ £2.
This leads to (2), as claimed. □
The Levy - Prokhorov distance is quite tricky to compute, whereas the Wasserstein distance can be found explicitly in a number of cases. Say, in 1D case W = R1 we have (cf. [1]).
Theorem 2.
^i(Pi, P2)= / |Fi(x) — F2(x)l dx. (3)
J R
Proof. First, check the upper bound ^i(Pi, P2) ^ fR lFi(x) — F2(x)| dx. Consider £ ~U[0,1], X, = F-i(f), i = 1, 2. Then, in view of Fubini theorem,
E[|Xi - X2|]= / IF-1 (y) - F-1 (y)l dy = IF1 (x) - F2(x)l dx.
J0 J R
Let us now prove the inverse inequality. Set Y = X2 — X1 V 0,Z = X1 — X2 V 0 then E[|Xi - X21] = Е[У] + E[Z]. It could be easily checked that
E\Z\= \ P(X1 < y,X2 > у)| dy.
R
A similar argument can be done for Y, by swapping X1 and X2. This yields E[|X1 - X2|]= / [Р(^ < y,X2 > y)+ P(X2 < y,X1 > У)] dy =
R
1
= / [Pi№ < y)+ P2(X2 < y) - 2P(X < y,X2 < y)] dy >
J R
^ [F(x) + Fz(x) - 2min[Fi(x),Fz(x)]] dx = F(x) - Fz(x)ldx.
RR
Proposition 1. For d = 1 and p > 1
/œ rœ
dy max[F2(y) - F(x), 0](x - yf-2 dx+
■œ J y rœ rœ
+p(p - lW dx max[Fi(x) - Fz(y), 0](y - x)p-2 dy.
□
—œ «/ ж
Proof. Follows from the identity
/œ rœ
dy [F-2 (y) -F (x, j/)](x - ?/)p-2 dx+
-œ J y œœ
+p(p - 1) / dx [F (x) - F (x, y)](y-xY~2 dy.
—œ J X
The minimum is achieved for F(x, y) = min[F(x),F2(?/)]. Alternative expression (see [2]):
Wp(P!, P2 )p = / Vf1 (t) - F—1 dt.
J 0
□
Proposition 2. Let (X, Y) G R2d be jointly Gaussian random variables (RVs) with E[X] = , E[Y] = . Then the Frechet-1 distance
pFl (X, Y) := E
Eft -ъ i
u=1
d
£
3=1
(tf - tf H 1 - 2Ф(-^^)) + 2âjip(— «¿-Я.)
С
Оч
(4)
where a3 = ((af )2 + (a]f)2 - 2Cov(X,,Yj))1/2, y and $ are PDF and CDF of the standard Gaussian RV. Note that in the case = the first term in (4) vanishes, and the second term gives
PFl (X, Y) = ^ .
i=1
We present also expressions for the Frechet-3 and Frechet-4 distances
/3(X, Y) = E\Xj — Yj7 = ( £-fi1 - 2^-^^))
+
v3 =
+6(^f - $-
+ 3(ôj)2(pf - p? )
1-2ф(-^ )
1/3
-2(^Л /_ (м? - fi)
(X, Y) = E|X, - Yj|4j = ^¿(„f - ^)4 + 6(M? - ^)2 + 3(<Tj)4j
Let pf = $. The expressions for pFl — pF4 are minimized when Cov(X^-,Yj),j = 1,... ,d are maximal. However, this fact does not lead immediately to the explicit expressions for Wasserstein's metrics. The problem here is that the joint covariance matrix EX,Y should be positive-definite. So, the straightforward choice Corr(X^,Yj) = 1 is not always possible, see Theorem 3 below.
Maurice Rene Frechet (1878-1973), a French mathematician, worked in topology, functional analysis, probability theory, and statistics. He was the first to introduce the concept of a metric space (1906) and prove the representation theorem in L2 (1907). However, in both cases the credit was given to other people: Hausdorff and Riesz. Some sources claim that he discovered the Cramer-Rao inequality before anybody else, but such a claim was impossible to verify since the lecture notes of his class appeared to be lost. Frechet worked in several places in France before moving to Paris in 1928. In 1941 he succeeded Borel as the Chair of Calculus of Probabilities and Mathematical Physics in Sorbonne. In 1956 he was elected to the French Academy of Sciences, at the age of 78, which was rather unusual. He influenced and mentored a number of young mathematicians, notably Fortet and Loeve. He was an enthusiast of Esperanto; some of his papers were published in this language.
In the Gaussian case, it is convenient to use the following extension of Dobrushin's bound for p = 2:
pL-P (Pi, P2) < (Pi, P2)]P/2, 1.
Theorem 3. Let X« E2),i = 1, 2, be d—dimensional Gaussian RVs. For
simplicity, assume that both matrices Ei and E2 are non-singular1. The L2 — Wasserstein distance ^(Xi, X2) = ^2(N(pi, Ei), N(^2, E2)) equals
^2 (Xi, X2) = [||pi — P2II2 + tr(E2) + tr(E2) — 2tr[(Ei E2 Ei )i/2 ]]i/2 (5)
where (EiE2Ei)i/2 stands for the positive-definite matrix square-root. The value (5) is achieved when X2 = p2 + A(X1 — p1) where A = E-1 (EiE2Ei)i/2E-1 .
Corollary. Let p1 = p2 = 0. Then for d = 1: (X1? X2) = |a1 — o2|. For d = 2
^2(Xi, X2) = [tr(E2) + tr(E2) — 2[tr(E?E2) + 2^det(EiE2)]1/2] 1/2. (6)
Note that the expression in (6) vanishes when E2 = E2.
^), Y ~ N(0, Ey) where e^ = o~xand Ey = oy
Example 1. (a) Let X -N(0, E?), Y - N(0, Ey) where E? = a?Id and Ey = Id.
Then Щ(X, Y) = Vd\a? - ay|.
(b) Let d = 2, X -N(0, E?), Y -N(0, E^) where E? = a?I2, E^ = 4 ^ p € (-1,1). Then
Щ(X, Y) = 21/2 (a? + 4 - [2 + 2(1 - p2)1/2]1/2) ^
In general case the statement holds with E-1 understood as Moore-Penrose inversion.
(c) Let d = 2, X -N(0, ^), Y -N(0, ) where = 4 7 ,
4 Ц ^ and px,P2 e (-1,1). Then
a
2nl/2\ !/2
W2(X, Y) = 21/2 (4 + 4 - [2 + 2PlP2 + 2(1 - pi)1/2(1 - pi)1/2]1/2)
Note, that in the case p = p2, W2(X, Y) = V2\ax - | as in (a).
Proof. First, reduce to the case p1 = p2 = 0 by using the identity W22(X1? X2) = = ||p — P2||2 + W22(£1? £2) with ^ = Xi - pi. Note that the infimum in (5) is always attained on Gaussian measures as W2(X1, X2) is expressed in terms of the covariance matrix E2 = T?XY only (cf. (8) below). Let us write the covarianve matrix in the block form
E2 =/S1 K\ _( S1 0\(I 0WE1 e^/A E = \KT E2J = I,/1E-1 I)\0 s) y 0 I ) (7)
where the so-called Shur's complement S = E2 — KTE-2K. The problem is reduced to finding the matrix K in (7) that minimizes the expression
f ||x - y ||2 dPx,v(x, y) = tr(E2) + tr(E2) - 2tr(K) (8)
J RdxRd
subject to constraining that the matrix E2 in (7) is positively definite. The goal is to check that the minimum (5) is achieved when Shur's complement S in (7) equals 0. Consider the fiber a-1 (S), i.e. the set of all matrix K such that a(K) := Ey - KT(E2X)-1K = S. It is enough to check the maximum value of tr(K) on this fiber equals
f max tr(K) = tr [(Ey (E^ - S)Ey)1/2] . (9)
Since the matrix S is positively defined, it is easy to check that the fiber S = 0 should be selected. In order to establish (9), represent the positively definite matrix Ey - S in the form Ey - S = UD^ where the diagonal matrix D2T = diag(A2, ...^Xt, 0,..., 0) and Xi > 0. Next, U = (Ur\Ud-r) is the orthogonal matrix of the corresponding eigen vectors. We obtain the following r x r identity:
(E^1KUrD-1 )T (Ej1KUrD-1) = Ir.
It means that Ej1KUrD-1 = Or, an 'orthogonal' d x r matrix, with OjOr = Ir, and K = ExOrDrUj. The matrix Or parametrises the fiber a-1(S). As a result, we have an optimization problem
tr(OTM) ^ max, M = ExUrDr,
in a matrix-valued argument Or, subject to the constraint O^Or = Ir. A straightforward computation gives the answer tr[(MTM)1/2] which is equivalent to (9). The technical details can be found in [3] and [4]. □
For general zero means RVs X, Y e Rd with the covariance matrices E2= 1, 2 the following inequality holds [5]
tr(E1) + tr(E2) - 2tr[(E1 E2E1 )1/2] < E[||X - Y||2] < tr(E1) + tr(E2) + 2tr[(E1 E2E1 )1/2].
2. The distances between distributions of different dimensions
For m ^ d define a set of matrices with orthonormal rows:
0(m, d) = { V £ Rmxd : VVT = Im} and a set of affine maps p : Rd ^ Rm such that pv,b(x) = Vx + b.
Definition 1. For any measures p £ M(Rm) and v £ M(Rd), the embeddings of p into Rd are the set of d-dimensional measures (p, d) := {a £ M(Rd) : (a) = p} for some V £ 0(m, d),b £ Rm, and the projections of v onto Rm are the set of m-dimensional measures (v, m) := {$ £ M(Rm) : (v) = $} for some V £ 0(m, d),b £ Rm.
Given a metric 7 between measures of the same dimension, define the projection distance 7-(p, v) := inf^(^,m) ^(p, $) and the embedding distance 7+ (p, v) := = infaG7(a, v). It may be proved [6] that 7+ (p, v) = 7-(p, v), denote the common value by v).
Example 2. Let us compute the Wasserstein distance between one-dimensional X , a2) and d-dimensional Y , E). Denote by AT ^ A2 ^ ... ^ Ad the
eigenvalues of E. Then
!a - vA if a > ,
0 1 fVAd ^a < OAT, (10)
^Ad - a if a < ^f\~d.
Indeed, in view of Theorem 3, write
(WT))2 = n min r - xT^2 - 6||2+
||x||2=1,6GR L
+tr(a2 + xTEx - 2^VxTEx) = min (a — VxT Ex )2,
J ||xM2 = 1
and (10) follows.
Example 3 (Wasserstein-2 distance between Dirac measure on Rm and a discrete measure on Rd). Let y G Rm and p1 G M(Rm) be the Dirac measure with p1 (y) = 1, i.e., all mass centered at y. Let x1xk G Rd be distinct points, p1,...,pk ^ 0, p1 + ... + Pk = 0, and let^2 G M(Rd) be the discrete measure of point masses with p2(xi ) = pi, i = 1,..k. We seek the Wasserstein distance W2 (p1 ,p2 ) in a closed-form solution. Suppose m ^ d, then
(W2T fr ,M2))2 = vmf eR. £ft«Vx + 6 - y|2 =
=1
k k
12
v^E»н^-Ep.yx«2=vM tr(vc^)
«=1 «=1
k
noting that the second infimum is attained by b = y — ^ PiVx and defining C in the
«=1
last infimum to be
T
C := > Pd X - > X I I x - > I G Rdxd.
k k k :=^ pA x- ^ к xj i x- ^ ^ xj
Let the eigenvalue decomposition of the symmetric positive semi-definite matrix C be C = QAQT with A = diag(Ai,..Ad), Ai ^ ... ^ Xd ^ 0. Then
m—1
inf tr(VCVT) = V A—
veO(m,d) ^ d 4
i=0
and is attained when V G O(m,d) has row vectors given by the last m columns of
Q G O(d). □
A closely related question is to find a projection of zero-mean Gaussian models to the space of a low dimension r such distance between the projections of X and Y is maximal. We start the discussion with the TV distance. Suppose r ^ d, and we want to find a low-dimensional projection A G Rrxd,AAT = Ir of the multidimensional data X - N(p1, E1) and Y - N(p2, E2) such that TV(AX,AY) ^ max. The problem may be reduced to the case = p2 = 0, E1 = Id, E2 = E, cf. [7]. Based on the results from [7,8] it is natural to maximize
min[l,^ д{1г)]
i=1
where g(x) = (X — l)2 and ^ are the eigenvalues of AEAT. Consider all permutations n of these eigenvalues. Let
n * = argmax^^ g(A„(i)), 7 = A^* w, i = l,...,r.
=1
Then rows of matrix A should be selected as the normalized eigenvectors of E associated with the eigenvalues 7i.
Remark. For zero-mean Gaussian models, this procedure may be repeated mutatis mutandis for any of the so-called /-divergences Df (P||Q) := EP [/ ("dp)] where f is a convex function such that /(l) = 0, cf. [7]. The most interesting examples are:
1) KL-divergence: f(t) = ilogi and g(x) = 2(x — logx — l);
2) symmetric KL-divergence: f(t) = (t — l) logi and g(x) = 1 (x + X — 2);
3) the total variance distance: f(t) = 1 \t — l| and g(x) = (X — l)2;
4) the square of Hellinger distance: f(t) = (vi — l)2 and g(x) = ;
5) x2—divergence: f(t) = (t — l)2 and g(x) = , 1 .
y x(2—x)
For estimations, the following result is utterly useful.
Theorem 4 (Poincare Separation Theorem). Let E be a real symmetric d x d matrix, and A be a semi-orthogonal r xd matrix. The eigenvalues of E (sorted in the descending order) and the eigenvalues of AEAT denoted by {7^= l,..., r} (sorted in the descending order) satisfy
Ad-(r—i) ^ Ai, i = l,...,r.
Let X, X be random variables with the probability density functions p, q, respectively. Define the Kullback - Leibler (KL) divergence
KL(PXl ||PX2 ) = j plog V-
The KL-divergence is not symmetric and does not satisfy the triangle inequality. However, it gives rise to the so-called Jensen - Shannon metric [9]
js(p, Q) = V^(P||R) + ^(QI|R)
with R = 2(P + Q). It is a low bound for the total variance distance
0 < JS(P, Q) < TV(P, Q).
Jensen - Shannon metric is not easy to compute in terms of covariance matrices in a multi-dimensional Gaussian case.
A natural way to develop a computationally effective distance in the Gaussian case is to define first a metric between the positive-definite matrices. Let Ai,...,Ad be the generalized eigenvalues, i.e. the solutions of det(E1 — AE2) = 0. Define the distance
fd
between the positively definite matrices by d(E1, E2) = w ^(lnAj)2, and a geodesic
=1
metric between Gaussian PDs X1 , E1) and X2 ~N(p2, E2):
/ d x 1/2
d(XuX2) = (¿TS- V2 + y=J (ln Aj )2J (11)
where 5 = — p2 and S = 1E1 + 1E2. Equivalently,
d2(E1, E2) = tr [(ln(E-1/2E2E-1/2))2] . (12)
Remark. It may be proved that the set of symmetric positively-definite matrices M + (d, R) is a Riemannian manifold, and (12) is a geodesic distance corresponding to the bilinear form В(X, Y) = 4tr(XY) on the tangent space of symmetric matrices M(d, R).
Note that the geodesic distance (11) and (12) between Gaussian PDs (or corresponding covariance matrices) is equivalent to the formula for the Fisher information metric for the multivariate normal model [5]. Indeed, the multivariate normal model is a differentiable manifold, equipped with the Fisher information as the Riemannian metric, which may be used in statistical inference.
Example 4. Consider i.i.d. random variables Zt,... ,Zn being bi-variately normally distributed with diagonal covariance matrices, i.e. we focus on the manifold Mdiag = = {N(p, Л) : p e R2, Л diagonal}. In this manifold, consider the submodel M*iag = = {N(p, a21) : p e R2, a2 e R+} corresponding to the hypothesis H0 : af = . First, consider the standard statistical estimates Z for the mean and si, 52 for the variances. If a2 denotes the geodesic estimate of the common variance, the squared distance between the initial estimate and the geodesic estimate under the hypothesis H0 is given by
n 2
(■"й"+(to D"
which is minimized by <r2 = 5is2. Hence, instead of the arithmetic mean of the initial variance estimates, we use an estimate of the geometric mean of these quantities. □
Finally, we present the distance between the symmetric positively definite matrices
of different dimensions. Let m ^ d, A is m x m and B = l^11 ^12 ) is d x d; here B11
\B21 B22 J
is m x m block. Then the distance is defined as follows
/ m \ 1/2
d2(A, B) := ^ (max[0, lnAj(A-1Bn)])2 J . (13)
In order to estimate the distance (13), after the simultaneous diagonalization of matrices A and B, the following classical result is useful.
Theorem 5 (Cauchy interlacing inequalities). Let B = i^11 B12 ) be a d x d
\B21 B22 J
symmetric positively definite matrix with eigenvalues A1 (B) ^ ... ^ Ad(B) and m x m block B11. Then
Aj(B) < Aj(Bn) < Aj+d-m(B),j = l,...,m.
3. Context sensitive probability metrics
Let the weight function or graduation p > 0 of the phase space X is given (cf. [10,11]). Define the total weighted variation (TWV) distance
r^(P1, P2) = 1 | sup i pdP1 - i Lpd?2 + sup i p dP2 - i p dP^ ) .
2 \ A IJ A J A J A IJa J A \J
Similarly, define the weighted Hellinger distance. Let p1,p2 be the densities of P15 P2 wrt to a measure . Then
nv(Pi, P2) - ^ (J <p(Vp1 - Vp2)2 <b)
1/2
'2 )2 '
Lemma 1. Let p1, p2 be the densities of P15 P2 wrt to a measure v. Then r^(P15 P2) is a distance and
^(P1, P2) = U p\p 1 - P21 dv. (14)
Proof. The triangular inequality and other properties of the distance follow immediately. Next,
/PP1 - / PP2) + 2
J V(P 1 - P2) = № - j №) + 2 / 1 - P2 I dV^ J P(P2 - Pi) = 2^ № - J VPi) + 2 / 1 - P2 I dV-
Summing these equalities, one gets (14). □
Let J* pp1 dv ^ f pp2 dv. Then, by the weighted Gibbs inequality [10], KL^(P1 ||P2) ^ 0.
Theorem 6 (Weighted Pinsker's inequality).
IJ Ф1 - P21 ^yjKL^(P1 ||P2)/2^ J №.
Proof. Define the function G(x) = xlogx — x + l. The following bound holds
G(x) = xlogx — x + l ^ |(x —l2)2, x > 0. (15)
2 x I 2
Indeed, since both terms of the inequality (15) coincide at x = l, and their first derivatives coincide at x = l, the following inequality f"(x) = X ^ ?x+2)3 proves the result. Now, by the Cauchy-Schwarz inequality
— ii
2
Ы
(21 — i)
^ + 2 P2
<
^ 3 p
(- — 1)
P2 J
^ + 2 22
-P2
VP1 ^ / pg(^)P2 I VP1 < kl^(P1 ||P2) 1 (fP1.
□
Theorem 7 (Weighted Le Cam's inequality).
^(P1, P2) ^ Щ(P1, P2)2.
Proof. In view of inequality
one gets
1. , 1 1 r n 1 1 ,_
2Ip 1 — P2I = 1 + 2P2 — minb 1,P2] ^ 2P1 + 2P2 — '
^ (P1, P2) ^ 2 / VP1 + 1 I VP2 —I VVP1P2 = Щ (P1, P2).
□
Next, we relate TWV distance to the sum of sensitive errors of both types in statistical estimates. Let C be the critical domain for checking the hypothesis H1 : P1 versus the alternative H2 : P2. Define by a^ = Jcpp1 and /3^ = j'x^c^p2 the weighted error probabilities of the I and II types.
Lemma 2. Let d = dc be the decision rule with the critical domain C. Then
inf+ ^] = 1
tpdP1 + tpdP2
— ^(P1, P2)
Proof. Denote C* = {x : p2(x) > p1 (x)}. Then, the result follows from the equality for all C
rc
^dP1 +1 p dP2 = i Jx\c 2
p dP1 + tpdP2
+
+ J pip1 — P21 [1(х eC n*\C*) — 1(х eC nc*)].
□
Theorem 8 (Weighted Fano's inequality). Let P1,...,PM, M ^ 2 be probability
distributions such that Pj ^ P^, V j, k. Then
f l M f
i]f 1m|M J ^(x)l(d(x) =i)dP7(x) ^ M ifdP3 —
M l M
^KLV (P, Pfc) + log 2 VdPj
],k 3 = 1 J
log(M — 1)
1
M2
(16)
where the infimum is taken over all tests with values in {l,..., M}.
Proof. Let Z e {1,... ,M} be a random variable such that P(Z = i) = and let
X ~ Pz. Note that Pz is a mixture distribution so that for any measure v such that
-
Pz << V, we have ^ = ^ and so
k=l
P(Z = ЦХ)= dP:i(x) I VdPfc(®)X
fj dP* (x)) •
It implies by Jensen's inequality applied to the convex function — logx
r M
/ v(x)Y, P(Z = № = x) log P(Z = jix = x) dPx (x) < J i=i
< L EJ£) dP, — log(M) J £ =
1 M 1 M r
= E(P-P*) — log(J)^ EJ Wi• (17)
On the other hand, denote by qj = Pp==XXX) and h(x) = x logx + (1 — x) log(1 — x). Note that h(x) ^ — log2 and by Jensen's inequality J2j=d(x) Qj logQj ^ — log(M — 1). The following inequality holds
M
^P(Z = nx )log P(Z = j|X ) = 3=1
= h(P(Z = d(X)|X)) + P(Z = d(X)|X) ^ * log Qj >
j=d(X)
^ — log 2 — log(M — l)P(d(X) = Z |X )log(M — 1). (18)
Integration of (18) yields
M
/ P(Z = j|X = x) log P(Z = j|X = x) dPx(x) ^
J i=i
1 -
^ — bg2ME I ^dPj — log(M — 1) ma-I <p(x)l(d(x) = j) dPj. (19)
j=i
Combining (17) and (19) proves (16). □
References
1. Vallander S. S. Calculation of the Wasserstein distance between probability distributions on the line. Theory of Probability & Its Applications, 1974, vol. 18, iss. 4, pp. 784-786. https://doi.org/10.1137/1118101
2. Rachev S. T. The Monge - Kantorovich mass transference problem and its stochastic applications. Theory of Probability & Its Applications, 1985, vol. 29, iss. 4, pp. 647-676. https://doi.org/10.1137/1129093
3. Givens C. R., Shortt R. M. A class of Wasserstein metrics for probability distributions. The Michigan Mathematical Journal, 1984, vol. 31, iss. 2, pp. 231-240. https://doi.org/10. 1307/mmj/1029003026
4. Olkin I., Pukelsheim F. The distances between two random vectors with given dispersion matrices. Linear Algebra and its Applications, 1982, vol. 48, pp. 257-263. https://doi.org/ 10.1016/0024-3795(82)90112-4
5. Dowson D. C., Landau B. V. The Frechet distance between multivariate Normal distributions. Journal of Multivariate Analysis, 1982, vol. 12, iss. 3, pp. 450-455. https://doi.org/10. 1016/0047-259X(82)90077-X
6. Cai Y., Lim L.-H., Distances between probability distributions of different dimensions. IEEE Transactions on Information Theory, 2022, vol. 68, iss. 6, pp. 4020-4031. https: //doi.org/10.1109/TIT.2022.3148923
7. Dwivedi A., Wang S., Tajer A. Discriminant analysis under f -divergence measures. Entropy, 2022, vol. 24, iss. 2, art. 188, 26 p. https://doi.org/10.3390/e24020188
8. Devroye L., Mehrabian A., Reddad T. The total variation distance between high-dimensional Gaussians. ArXiv, 2020, ArXiv:1810.08693v5, pp. 1-12.
9. Endres D. M., Schindelin J. E. A new metric for probability distributions. IEEE Transactions on Information Theory, 2003, vol. 49, iss. 7, pp. 1858-1860. https://doi.org/10.1109/TIT. 2003.813506
10. Stuhl I., Suhov Y., Yasaei Sekeh S., Kelbert M. Basic inequalities for weighted entropies. Aequationes Mathematicae, 2016, vol. 90, iss. 4, pp. 817-848. https://doi.org/10.1007/ s00010-015-0396-5
11. Stuhl I., Kelbert M., Suhov Y., Yasaei Sekeh S. Weighted Gaussian entropy and determinant inequalities. Aequationes Mathematicae, 2022, vol. 96, iss. 1, pp. 85-114. https://doi.org/ 10.1007/s00010- 021- 00861- 3
Поступила в редакцию / Received 09.12.2022 Принята к публикации / Accepted 25.12.2022 Опубликована / Published 30.11.2023