A. V. Lapko
Institute of Computational Modelling, Russian Academy of Sciences, Siberian Branch, Russia, Krasnoyarsk
V. A. Lapko
Siberian State Aerospace University named after academician M. F. Reshetnev, Russia, Krasnoyarsk
THE ANALYSIS OF NONPARAMETRIC MIXTURE PROPERTIES WITH A PROBABLITY DENSITY OF A MULTIDIMENSIONAL RANDOM VARIABLE
The asymptotic properties of a mixture with nonparametric estimations of probability density with a multidimensional random variable are researched in this article. They are compared with the properties of the traditional Rosenblatt-Parzen type nonparametric probability density estimation, depending on the quantity of the composed mixture and dimension of the random variable.
Keywords: mixture of probability densities, nonparametric estimation, large samples, asymptotic properties.
The application of nonparametric statistics methods based on the estimations of Rosenblatt-Parzen type probability density [1; 2] is a rapidly developing modelling method of priori uncertainty systems. However, when the research conditions of the system are complicated, there appear methodical and computing difficulties in traditional nonparametric algorithms and models; this can be clearly observed during the processing of statistical data in great amounts.
The perspective "detour" direction of the arisen problems consists in the application of decomposition principles of training samples according to their size, and the application of the parallel calculation technology.
The purpose of this work is to prove the effective usage of decomposition principles when processing large-scale arrays of statistical data, on the basis of the asymptotic properties' analysis for a nonparametric estimation of probability density mixture.
Let sample V = (x', i = 1, n) from n independent observations of k - dimensional random variable x = (xv, v = 1, k) be with a probability density p (x). The
type p (x) is a priori unknown.
Let's divide sample V into T observation groups Vj =(x', i e Ij), j = 1, T . Multiple observation numbers x in the group with number j shall be identified
n = N
as Ij. While: Q Ij = I = (i = 1, n) .The quantity
j=1
of units in samples Vj = (x' , i e Ij) is equal and equals
_ n n = — .
T
At each sample Vj let us construct a nonparametric
estimation of probability density with a multidimensional random variable x [1]:
Pj(x) = -
1
n c
xn®
f - xA
j = 1, T . (1)
symmetry. The parameters of nuclear cv = cv (n) functions decrease with the increase of n .
Let the intervals of component xv value change for vector x be identical. In these conditions it is reasonable to assume that the values of coefficients cv in nonparametric estimations of probability densities pj (x),
j = 1, T are identical and equal to c. Then estimation (1) will look as:
(x ^ xn®
r x„ - x ^
n c
j = 1, T . (2)
As for magnifying p (x) with statistical sample V we
shall use a mixture of nonparametric estimations of a probability density type:
= 1 T _
p (x )=t x pj(x ).
(3)
j=1
Statistics (3) allows the usage of parallel calculation technology while estimating the probability density in conditions of large samples.
The asymptotic properties p (x) are defined by the following statement.
The theorem. Let p (x) and its first two derivatives
from each component xv, v = 1, k be limited and
continuous; the ®( uv ) conditions:
nuclear functions
satisfy
In statistics (1), the nuclear function ®(mv ) is satisfied to conditions of normalization, positivity, and
®(uv ) = ®(-uv), 0 <®(uv )<ro, J®(uv )duv = 1, J uv2 ®(uv )duv = 1, Jum ®(uv)duv , 0 <m ; v = 1, k ,
of sequence c = c (n) for blur coefficient in nuclear functions are such, that at n ^ro values, c ^ 0 and
nck ^ ro .
Then at finite values T the nonparametric estimation (3) of the probability density p (x)has a property of asymptotic unbiasedness and competence.
felj v=1
v=1
Hereinafter infinite limits of integration are omitted. The proof: 1. By definition:
M ((( x) )= T £ M (( (x)) = T i SJk
j—"
■in ®
1 V c
k
P( x1>
T j=1 nc ielj
Xk ) dX"" • • • dXk —
/
xv - tv
„2 k
W" — M(p(x)-p(x)Xp(2) (x),
2 v=1
(4)
where pf (x) - is the second derivative of the probability density p(x) at component xv.
From here, in condition that c ^ 0 at n ^ro, appears the property of the asymptotic unbiasedness for a mixture of nonparametric probability density estimations (3).
2. For convergence proof of p( x) in square mean we shall consider the following expression:
MJ(p(x)-p(x)) dx1...dxk =
—m i... i
7 X (p( x)- pj ( x))
. j—1
dx" • • • dxk —
— —— M
t 2
XJ.i(p(x)-pj(x)) dx"...dxk +
.j—"
■ XXJ. • • i (p( x) - pj( x) ) (p( x) - pt( x) )dx"• • • dxk
j—11—1 t * j
(5)
(6)
which, with great enough volumes of statistical data considering expression (4) is presented as:
2k
i••• i| p(x)+yXp(2)(x)| dx"•••dxk •
(7)
= ^ JK Jn^1^ J p^ h ) dt1 K dtk =
k
= J... Jn®(Mv) P(x1 - cw1,..., xk - cwk) du1... duk ,
v=1
where M - is a mathematical expectations sign. When performing the conversion, it is considered that statistical sample units Vj, j = 1, T are values of the same random variable t with a density probability of p(tj,..., tk).
Let's spread out p(x1 -cu ,., xk - cuk) in the Taylor row at point x = xj,...,xk and being limited by the first two terms of the series, we get:
Notice that the asymptotic statistics expression of
type:
M J... J pt (x)p(x) dx1... dxk
corresponds to:
( c2 k \ J... J^ p (x) +—a pV2)(x)^ p (x) dxj .dxk . (8)
Substituting expression (7), (8) in (6), after a series of simple conversions will give:
M J . J ( p (x) - pj (x)) (p(x) - pt (x)) dxj... dxk ~
~ c4 J. j(i pV2)(x)^ dxj . dxk = c4b . (9)
In V. A. Epanechnikov's research [2] - an asymptotic expression for the purpose of square deviation in nonparametric probability density estimation p(x), composing the first part of expression (5), is received:
M i • • • i ( p ( x) - pj ( x) ) dx"... dxk
□
ni®2 (Uv )duv 4
' - +—B.
4
(10)
—1
n ck
Accounting (9) and (10), expression (5) with enough n values is represented as:
M J... i( p( x) - p( x)) dx"... dxk
□
ni®2 (Uv )duv 4 ' - +—B.
(11)
—1
Let's find the asymptotic component expression for the second part of expression (5):
Mi - i (p(x) - pj (x)) (p(x) - p, (x)) dx"... dxk —
— i... i p1 (x) dx"... dxk - Mi... i pt (x)p(x) dx"... dxk -- Mi... i pj (x)p(x) dx"... dxk + + M i... i pj ( x) pt ( x) dx"... dxk.
Let's transform its last part:
M i... i pj ( x) pt ( x) dx"... dxk —
— j. i M (pj ( x) )m (pt ( x) )dx ... dxk,
Tnck 4
It is not difficult to notice that in conditions c ^ 0 at nck ^ro the estimation n ^ro of probability density mixture (3) converges in square mean to p(x); considering the property of its asymptotic unbiasedness is well-founded.
At T = 1 the received result (11) coincides with Epanechnikov's theorem [2], which confirms the correctness of the fulfilled conversions.
The analysis of approximating properties of statistics p (x). For the efficiency analysis of a nonparametric
estimation of probability densities mixture (3) and the Rosenblatt-Parzen estimations of a probability density:
1 n k (
p (x h-V i II ®
f x., - x ^
let's consider the ratio of asymptotic expressions, corresponding to deviation squares for the best coefficients of blur values in nuclear functions.
i—" v—"
Let's define the minimum value W2 of expression (11)
with optimal coefficient c* values of blur nonparametric estimations pj (x) composing the probability densities mixture. In the accepted assumption value:
*
c =
k
kЩф2 (uv )du
nB
(k+4)
Then:
W2 =
k
Щф2 (Uv )du
Bk
(k+4)
4 + Tk
(12)
4 Tk(k+4)
If к = 1, then W2 - is coincides with the minimal
asymptotic expression of square deviation for the mixture of nonparametric probability densities estimations, obtained in study [3].
At Г = 1 and n = n expression (12) corresponds to the minimal asymptotic expression W2' for a deviation square of the probability density Rosenblatt-Parzen type estimation [2].
After simple conversions we get: W 4 + Тк
r2 =—
2 W
(4 + k )T
k
(k+4)
By analogy we shall calculate the ratio for the minimal values of the main dispersing composing statistics p (x)
and p (x):
W3 =-
Tk
k
(k+4)
f к Y
П|ф2 (Uv )du
Bk
(k+4)
W ' = ——
k
k(k+4)
f П|ф2 (Uv)du Л
Bk
k
(k+4)
Their ratio looks as:
R3 =
Wl W3
(k+4)
T
It is easy to be convinced, that the ratio of asymptotic expressions offset: W1, W/ for the estimated probability
density p ( x) and p ( x) at optimal blur coefficients for
nuclear functions, is equal to:
R = WL = T (k+4) 1 W'
Dependences of ratios R2 (a), R3 (b), Rj (c) from the dimension of random variable к and x = ( , v = 1, к) quantity T = 1-10 (curves 1, ..., 10), composing the nonparametric estimations mixture of probability density p(x) (3)
k
b
a
With growth of component quantity T of the nonparametric estimations mixture of probability density, there is an increase in ratio values R2> 1 (figure, a), R1> 1 (figure, c). The noticed deterioration of approximating mixture properties p (x) in comparison to traditional nonparametric estimation of density probability p (x) (12), points to the decrease in sample
sizes used during the estimation of compositions p (x).
This is a special feature of minor dimensions к of random variables. When complicating the estimating probability density with efficiency к , the growth of nonparametric estimations p (x) also decreases p (x). Criteria corresponding to them W2, W2' and W1, Wj become commensurable; this is evident in the decreasing of ratio R2 and Rj values.
The offered mixture p (x) of probability density estimations has a lesser dispersion in comparison to the nonparametric estimation p (x), which is identified by its
structure, since statistics synthesis p (x) is carried out on
the basis of an averaging operator (figure, b). With a quantity increase in T composing the mixture of
nonparametric estimations p (x), the density probability
and dimension к of random dimensions increases.
On the basis of the asymptotic properties analysis for nonparametric estimations mixtures of probability density with a multidimensional random variable, the decomposition possibility for initial statistical data under a synthesis of nonparametric statistics in large samples conditions is justified. The researched statistics, in comparison to the traditional Rosenblatt - Parzen nonparametric evaluation, has a considerably smaller dispersion and allows using parallel calculating technologies.
References
1. Parzen E. On estimation of a probability density function and mode // Ann. Math. Statistic. 1962. Vol. 33. P. 1065-1076.
2. Epanechnikov V. A. Nonparametric estimation of a many-dimensional probability density // Teoriya veroyatnosti i ee primeneniya, 1969. Vol. 14. № 1. P. 156-161.
3. Lapko V. A., Varochkin S. S., Egorochkin I. A. Development and research of a nonparametric estimation of the probability density grounded on a principle of decomposition of learning sample on its size // Vestnik SibSAU. 2009. Vol. 1 (22). P. 45-49.
© Lapko A. V., Lapko V. A., 2010
D. V. Lichargin Siberian Federal University, Russia, Krasnoyarsk
GENERATION OF THE STATE TREE BASED ON GENERATIVE GRAMMAR OVER TREES OF STRINGS
In the article the principle of state trees generation is considered based on the generative grammars over trees of strings in such objects as the sentences of natural languages, as well as two and tree dimensional images. The image of the object as a forest is considered; including the trees of object different layouts for the purpose of complex system modeling.
Keywords: natural language generation, generative grammars, semantics.
The problem of natural language sentences generation is one of the key issues in the field of computer science and formal grammar theories. The issue of meaningful speech generation applies to the area of semantics and computer science [1-7]. The states tree generation issue is studied well enough in computer science and in system analysis. In respect to the question of meaningful phrases tree generation the problem is first of all connected to the method of sentence generation by means of Chomsky's generative grammars. Generative grammars are successfully applied in software such as electronic translation systems, expert systems, systems of orthography checking, etc.
The flash point of the article is the analysis prospects for using generative grammars not over strings, but over trees of strings. In this respect it is possible to solve the
task of generating grammatically and semantically meaningful speech more effectively and increasing the efficiency of different images analysis and synthesis aspects.
The importance of the issue on effective generating language meaningful constructions and two or three dimensional images is generally understood and is connected with the demands of linguistic and other software.
The purpose of this research is to apply generative grammars on the necessity basis over trees as means of meaningful speech generation connected with greater heterogeneous context.
The novelty of the work is in the application of generative grammars not over strings but over trees of strings.