Foundational aspects of theory of statistical function estimation and pattern recognition

Fokoue E.

UDC 519.2, 004.93

Foundational Aspects of Theory of Statistical Function Estimation and Pattern Recognition

E. Fokoue

Department of Mathematics Kettering University 1700 West Third Avenue, Flint, MI, USA, 48504

This paper provides a gentle introduction to the foundational ideas, concepts and results in the field of science dedicated to the theory of statistical function estimation and pattern recognition. The so-called VC Theory of Vapnik and Chervonenkis is introduced and explored gradually. The emphasis is placed on helping the reader appreciate the importance of the extension of the classical law of large numbers to function spaces, and the key role that "new" concepts such as Empirical Risk Minimization (ERM) principle, ERM consistency, VC-dimension, and complexity control play in constructing algorithms that yield function estimators with optimal properties. As much as possible, each key concept is introduced via a tangible example, with the hope of helping the reader grasp the essential core of the foundational concept under exploration.

Key words and phrases: Statistical Learning Theory, Law of Large Numbers, Consistency, VC Theory, Regularization, Complexity Control, Bounds on generalization, Generalization.

1. Introduction

Let X and Y be two sets, and consider their Cartesian product ZeXxY . Now, define Zn = Z x Z x ... x Z to be the n-fold cartesian product of Z. Assume that Z is equipped with a probability measure and let

z £ Zn with z = ((xi,yi), (X2,y2),..., (Xn,yn))

denote the realization of a random sample of n examples, where each example zi = (xi,yi) is independently drawn according to the above probability measure ^ on the product space Z = X x Y. Throughout this paper, we consider the following problem: Given a random .sample z = ((x1,y1), (x2,y2),..., (xn,yn)) and assuming that the probability measure ^ is unknown, find the function f : X ^ Y that best captures the dependencies between the xi's and the yi's. We shall refer to X as the input space, and to Y as the output space. To help clarify the key concepts, ideas and results of interest, we shall consider a special case: X C R2 and Y = {—1, +1} or Y = {0,1}, corresponding to binary classification in the plane (two dimensional space). Despite its relatively straightforward nature, this special case will provide most of the ingredients needed to address the key foundational issues underlying the theory of statistical function estimation and pattern recognition. Thanks to the fact the binary classification problem is studied using rather basic mathematical tools, the intuition underlying the theory should be followed without too much effort, our emphasis here having been placed on the clarity of the results rather than the technical details thereof. The rest of this paper is organized as follows: in the remainder of section 2, the main definitions and concepts of statistical function estimation are provided. Section 3 explores the concepts introduced in section 2 for the specific case of binary classification in the two dimensional Euclidean space. Section 4 presents the foundational theorems with as much intuitive guidance and examples as possible. Section 4 also touches on some advanced concepts of statistical function estimation such as growth function and VC-dimension, while section 5 is dedicated to the conclusion and the discussion.

2. Loss Functions, Risk Functionals and Optimal

Prediction

Def 1 (Loss function and risk functional). Let f denote any generic function mapping an element x of X to its corresponding image f (x) in Y• Each time x is drawn from rf(x), the disagreement between the image f (x) and the true image y is called the loss, denoted by L(y,f (x)). The expected value of this loss function with respect to the distribution y) is called the risk functional of f. We shall denote the risk functional of f by R(f), so that

R(f) = E[L(Y,f (X))] = J L(y,f (x))drf(x,y).

Best predictor (universal): The best function f * over the space YX of all measurable functions from X to Y is therefore

f * = arg in f R(f).

The risk R* corresponding to f * is then defined as

R* = R(f *) = f f R(f).

Best predictor in function class F C YX: It turns out in practice that, because of the fact that rf is unknown, it is hard (almost impossible) to obtain an expression for f *. One therefore needs to select a function space F C YX, and then choose the best estimator f from F, i.e.,

f = arg mf R(f).

The risk R associated with the best function in class F is then defined as

R = R(f)= mf R(f).

Empirical Risk Minimization Principle: Since the distribution y) that generates the observations is unknown in practice, the risk functional R(f) which is our criterion for choosing the "best" function f (x) from the function class F cannot be computed. The theoretical risk functional R(f) is replaced by the so-called empirical risk functional

1 n

Rzn(f ) = - E L(yi,f (x)) (1)

i=1

based on the random sample z = ((x1, y{),..., (xn, yn)). The Empirical Risk Minimization (ERM) principle then consists of finding

fn = arg mm Rl(f).

The slightly complex notation fnz is used to emphasize the fact that the estimator obtained via the empirical risk is a random function based on a sample z of size -. The index n in this case helps to create the sequence of functions when it comes to studying properties such as convergence and rates of convergence. Indeed, a natural question that arises upon realizing that one is dealing with three different types of functions namely f *, f and fn is: What is the relationship between these three functions? For one thing, is it possible to quantify the difference between f * and f, i.e., for some

norm || ■ ||, what is the value of ||/ — / * ||? Since / is considered as an estimator of /, the natural statistical question is then: Is /% a consistent estimator of /? And if so, what is the rate of convergence of /£ to /? One may even be more ambitious and ask instead: Is /% a consistent estimator of /*? And if so, what is the rate of convergence of /Z to /*? In other words, what can be said about

zgZ

or for that matter

lim Probf ||/Z - / || <e}

i^œ zez™VUn J 11 J

lim Probf ||/Z - /*|| <4? n^oo zeZnVUn J 11 J

It turns out that addressing the comparison between /£ and / or / is hard, partly because finding the appropriate norm is not easy, but also constructing bounds is not clear even if one finds such a norm. Fortunately, since all the functions are derived through the risk functional, a more manageable approach to comparing the functions is to compare their corresponding risk functionals. For instance, given a fixed function /, how does R(/) compare to RZ (/)? Or more formally, for a fixed function /, and for all e > 0, what is the value of

^ gob{K(/) — R(/)| <

In other words, for a fixed given function /, does the empirical risk R^(/) converge to the theoretical risk R(/)? This convergence of risk functionals for a fixed function / will be referred to later as point wise convergence. Unlike the direct comparison of the functions that ran into problems as mentioned earlier, this comparison via risk functionals provides the added advantage that one can then make confidence statements about the unknown value of R(/). For instance, for a given 0 < 5 < 1, one can make statements of the form

R(/) — R(/)| < <KM)

with a probability of at least 1 — 5. Indeed, it turns out that pointwise convergence of the empirical risk to the true risk for a fixed function can be established by rather straightforward application of Chebyshev's inequality.

Theorem 1 (Chebyshev's inequality). Let £ be a random variable with finite mean E[£] and finite variance a2(£) = V(£). Then, Ve > 0,

Prob{|£ - E(£)| >e} < ^.

e2

A very natural application of Chebyshev's inequality is its use in the study of sums of independent random variables. Indeed, if £ is a random variable on a probability space Z with finite mean ^ = E[£] and finite variance a2 (£) = V(£) = a2, then Ve > 0,

Prob

zg z n

1 '

n £ ^(Zi) - n

t=i

a2

ne2

An immediate (direct) consequence of Chebyshev's inequality is the fact that the empirical mean 1 J2"=1 £(zi) converges in probability to the theoretical mean ^ in the

limit of very large samples (as n ^ to), i.e.,

1n

n r—f

i=1

Application to risk functionals: Consider a fixed function f G F, and let the random variable of interest be £ with £(z) = L(y,f(vx^)). Then RZ(f) = (1/n)2Z=i £(z) and R(f) = E(£). Besides, it can be easily shown that R(f) = E(RZ(f)). As a result, by Chebyshev's inequality, for a fixed function f, Ve > 0,

a

Prob (K(/) - R(/)| > e} < —,

zeZ"

ne2

so that, for a fixed function / e F,

Rz(/) -+ R(/).

Another, and perhaps even more important application of Chebyshev's inequality will be its use in deriving confidence statements about the difference between the empirical quantities of interest and their theoretical counterpart. In other words, while it is important that the convergence occurs, it will be crucial in learning theory to know how fast the convergence is in terms of (a) the number of examples observed (sample size); (b) the confidence level desired (1 — 5); and (c) a characteristic of the function class under consideration, maybe through its quantities like variances and other moments. For instance, the above Chebyshev's inequality on sums of random variables can be rewritten as

Prob

z£ Zn

i=1

X 1 - 0,

no

by simply setting e = \Ja2/(m5) for any 0 < 5 < 1. Therefore, one could assert, that with confidence at least 1 — 5,

1 n

n

i=1

1 a2

Bounds and rates for fixed f: Applying the above to risks functionals yields, for a fixed f G F,

1 a2

K(/) - R(/)| < V0-.

As it turns out, extensions (improvements) on Chebyshev's inequality will yield faster rates of convergence of empirical quantities RZ(f) to their theoretical counterparts R(f). One such improvement is provided by Hoeffding's inequality.

Theorem 2 (Hoeffding's inequality). Let £(z1), £(z2),...,£(zn) be a collection of i.i.d random variables with ) G [a, b]. Then, Ve > 0,

Prob

- e(0

i=1

> e

< 2 exp

—2ne2

(b - a)2

For all 0 e (0,1), Hoeffding's inequality allows one to write

Prob

- £ £(*) - E(£)

n

i=1

> <b - Win-

so that with probability at least 1 - o,

- £ £(*) - E(0

n

i=1

< <» - *)№

< o,

Assuming the loss function is bounded, a direct application of Hoeffding's inequality to risks functionals for a fixed f G F yields,

ln 2

K(f) — R(f)l < (b — aW^.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

By rewriting the Chebyshev bound as

R(f) — R(f)l <\J2(2)

it appears clear that Hoeffding provides faster rates of convergence, since the term in 5 is changed from 2/5 to ln(2/5), and also instead of the variance a2 of £, we now (b — a). Most of the that bounds are encountered here will have in them a function of 2/5, a function of 1/2n and a function of the variance. Indeed, 1 — 5 confidence bounds will be typically of the form

ln(2/5)12 n

with 0 < 0 < 1 depending on the variance of the random variable £.

For all the good insights they help gain, the above bounds, by virtue of their point wise nature, suffer from the following limitations: Given two distinct functions f and g both from F, the set Zf of samples for which the probabilistic inequality holds for fixed f may differ from the Zg of samples for which the probabilistic inequality holds for fixed g. In other words, crucially needed comparisons of the type

|RZ(f) — R(g)l or |RZ(g) — R(f)l

are not taken into account in the study of pointwise convergence. Clearly, more important than point wise convergence is uniform convergence that helps compare the risk functionals, not just for a fixed function, but for functions across the space of functions under consideration. For instance, it is crucial to quantify or at least provide a bound for the difference between the empirical risk RZ(fZ) and the smallest risk R = R(f) in the function class F under consideration, but also the difference between RZ (f) and R = R(f). In other words, one needs to study random sequences like

|R(f) — fnf R(f )| and |RZ(fZ) — fnf R(f )l

to gain insights into the quality of function estimation provided by f through the ERM principle. More generally, it boils to investigating various aspects of

AmoPrgbj-PR(f) — R(f)| <ej.

A reasoning on error decomposition and consistency of estimators along with rates, bounds and algorithms applies to function spaces: indeed, the difference between the true risk R(fZ) associated with f and the overall minimum risk R* can be decomposed to explore in greater details the source of error in the function estimation process:

R(f) — R* = R(f) — R(f)+ R(f) — R* . (3)

Estimation error Approximation error

Theorem 3 (Consistency of the Empirical Risk Minimization principle). The ERM principle is consistent if it provides a sequence of functions /", n = 1, 2,... for which both the expected risk R(/Z) and the empirical risk RZ(/Z) converge to the minimal possible value of the risk R(/) in the function class under consideration, i.e.,

R(/Z) jnfR(/) = R(/) and R/) jnfR(/) = R(/).

It turns out the above theorem on the consistency of the empirical risk minimization principle constitutes one of the four pillars of statistical learning theory as formulated and presented by such authors as [1]. When constructing function estimators, the least one should do is assess the convergence of the empirical quantities of interest to their theoretical counterparts. In [1], Vapnik provides the following four questions as the keys to statistical learning theory (a) What are the necessary and sufficient conditions for the consistency of a learning process based on the ERM principle? This first questions suggests the need for a theory of consistency of learning processes. (b) How fast is the rate of convergence of the learning process? A question that opens the door to the need for a nonasymptotic theory of the rate of convergence of learning processes as opposed to the traditional - sometimes unrealistic - asymptotic theory. (c) How can one control the rate of convergence (the generalization ability) of the learning process?. Here, the implication of the question is the need to develop tools and a theory for controlling the generalization ability of learning processes; and finally, (d) How can one construct algorithms that can control the generalization ability of the learning process?. This last question, clearly of interest to practitioners, allows statistical learning theory to seek to provide tools along with a theory for constructing/devising learning algorithms, with the aim of consolidating all the four pillars. Algorithms constructed this way are expected to focus on the problem at hand, with all its aspects taken into account as thoroughly as possible. In [1], Vapnik discusses the details of this theorem at length, and extends the exploration to include the difference between what he calls trivial consistency and non-trivial consistency. To better understand consistency in function spaces, consider the sequence of random variables

un = sup |R(/) — RZn (/)|, (4)

f eF

and consider studying

lim pi sup|R(/) — RZ(/)| >e\ = 0, Ve > 0.

IfeF J

Vapnik shows that the sequence of the means of the random variable £n converges to zero as the number n of observations increases [1]. He also remarks that the sequence of random variables £n converges in probability to zero if the set of functions F, contains a finite number N of elements. We will show that later in the case of pattern recognition. It remains then to describe the properties of the set of functions F, and probability measure ^(x, y) under which the sequence of random variables £n converges in probability to zero.

lim P

n—

sup[R(f) -Rl(f)] >e f eF

or

sup[RZ(f) - R(f)] > e

f eF

3. Statistical Theory of Pattern Recognition

0

Instead of exploring the details of the above theorem in pure abstraction, one of the most commonly encountered statistical tasks will now be explored under the framework of statistical learning theory and the details of the theorem will be clarified along the way: that task is statistical pattern recognition, and in this case binary

classification in two dimensional space will be studied. For this problem, the so-called 0 — 1 loss function defined below is used. More specifically,

L(y, f (x)) = I(y = f (x)) = { 0 if y = f £). (5)

Now, with the zero-one loss function in binary classification, our corresponding true risk (also known as theoretical risk or generalization error or true error) is given by

R(f) = / L(y, f (x))#(x, y) = E [I(Y = f (X))] = Prob(x,YW[Y = f (X)]. (6)

The true error R(f ) of a classifier f therefore defines the probability that f mis-classifies any arbitrary observation randomly draw from the population of interest according to the distribution It is important to note from the definition that R(f ) can also be interpreted as the expected disagreement between classifier f and the truth about the label y of x.

How is the Bayes' classifier obtained? Consider a pattern x from the input space, and a class label y. Let p(x|y) denote the class conditional density of x in class y, and let Prob[Y = y] denote the prior probability of class membership. The posterior probability of class membership is defined as

Prob[Y = y|x] = Prob[Y = y]p(x|y)

p(x)

Given a pattern x to be classified, the Bayes classification strategy consists of

Assigning x to the class with maximum posterior probability.

More formally, with f * denoting the function from X to {—1, +1} that implements the Bayes classifier, we have,

+ 1, if Prob[Y = +1|x]= max {Prob[Y = y|x]} ,

f*( ) = i ye{-i,+i}

f (x) * —1, if Prob[Y = — 11x] = max {Prob[Y = y|x]} .

ye{-i,+i}

For simplicity, we shall write

f *(x) = arg max Prob(Y = c|x). (7)

ce{-i,+i}

Theorem 4. The minimizer of the 0 — 1 risk functional over all possible classifiers is the Bayes classifier f* defined above. Therefore, the Bayes' classifier f* is such that

f *(x) = arg inf Prob(x,YW [Y = f (X)] = arg inf E [I(Y = f (X))].

Proof. Given x, the conditional risk (risk given x) can be broken down as follows: the conditional risk of assigning x to class +1 is

R(f (x) = +1) = L(f (x) = +1,Y = +1)Prob[Y = +1|x] + + L(f (x) = +1, Y = —1)Prob[Y = —1|x] = = Prob[Y = —1|x] = 1 — Prob[Y = +11 x].

From the above, minimizing the risk R(f (x) = +1) of deciding to assign x to class +1 under the 0-1 loss is equivalent to maximizing the posterior probability Prob[Y = +1|x] of x being in class +1. Therefore, the function f that minimizes R(f (x) = +1)

is the same function that is based on Prob[Y = +1|x] being the maximum. With all that, if

g = argmfin R(/(x) = +1),

then

g(x) = arg max Prob(Y = c|x) = / *(x).

ce{-i,+i}

The same reasoning can be made for R(/(x) = —1). The Bayes classifier is indeed the universal minimizer of the 0-1 risk functional. □

Note that the definition and therefore the construction of the Bayes classifier requires the knowledge of the probability density p(x,y) = p(x|y)p(y) which in practice is unknown. As stated in section 1, one then has to select a space of functions that one assumes contains a good classifier of the data.

A Simple Two Dimensional Classification Task

Consider the classification task of Figure 1 as an illustrative example. In order to construct the Bayes classifier for this task, one needs to know the probability measure according to which the points are generated.

11 10 9 8 7

CM

X

6 5 4 3 2

23456789 x1

Figure 1. Binary classification task in a two dimensional space

One way to circumvent the fact of not knowing ^(x, y) is to estimate the density p(x, y) and then construct the approximate Bayes' classifier. However, density estimation, as warned by Vapnik [1] and many other authors is, not only a hard problem in it own right, but an ill-posed problem also. In fact, a standard wisdom promoted by Vapnik in statistical learning theory is to solve the classification problem as directly as possible and avoid complicating the task with many intermediary and often hard problems. Instead of trying the construct the overall best classifier for the task, one should consider restricting the function search to a class of classifiers. In this case, the scatter seems to suggest that linear separation might be a decent strategy. In other words, one may consider finding the best linear separating hyperplane, i.e.

F = / : X ^ {—1, +1}|3ao e R, (ai,...,ap)T = a e Rp|

/(x) = sign (aTx + ao) ,Vx e xj.

It is important to note that although the function spaces in the above examples are driven by parameters that are components of a vector, a need not be a vector for more

*

* ** **

. ** **#

* * *

i *

**

** *'

% ** *

*

o„

*

f * *

o

o o o, O,

CO ©OD O

° O O o ° „ o r oo o c

o o

o

, af °

ä'o O o 00 O o

o

general function spaces. In fact, a is allowed to be any abstract set of parameters, so that any arbitrary set of functions can be defined and handled by this framework. Now, since the true risk for the best function in this class F cannot be computed because of the fact that ^(x, y) is unknown, one has to turn to the empirical risk. The empirical risk or empirical error corresponding to the true risk of equation (6), is given by

1 n

Rn(f) = - E !(y* = f (xi)) = Prob(X;y).s [Y = f (X)] . (8)

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

i=i

I is the indicator function, and Prob(x y)^s [...] is a probability taken with respect to the uniform distribution over the sample S. Rn(f) is therefore the realized disagreement between classifier f and the truth about the label y of x based on information contained in the sample S. In the case of the example of Figure 1, the function

fn = arg mmRn(f)

is the straight line that separates the two classes with the minimum number of mis-classifications. It is crucial to mention that, in the presence of the data of Figure 1, seeking to minimize the empirical risk without restricting the function class does lead overfitting. Indeed, one could construct a classifier that achieves zero empirical risk, but such a classifier, not only is too complex, but also does not generalize well, in the sense that future observations are likely to be misclassified. The control of complexity mentioned earlier as one of the pillars of statistical learning theory does come into play in such cases. Statistical learning theory deals with this issue by way of the so-called Structural risk minimization principle [1]. For now, the focus is on the convergence of the ERM principle in pattern recognition. It is easy to see that, for a given (fixed) function (classifier) f,

E[Rn(f)] = R(f). (9)

Remember that the goal of statistical function estimation is to devise a technique (strategy) that chooses from the function class F, the one function whose true risk is as close as possible to the lowest risk in class F. The question arises: since one cannot calculate the true error, how can one devise a learning strategy for choosing classifiers based on it? Tentative answer: At least devise strategies that yield functions for which the upper bound on the theoretical risk is as tight as possible, so that one can make confidence statements of the form:

With probability 1 — 5 over an i.i.d. draw of some sample according to the distribution the expected future error rate of some classifier is bounded by some function S, error rate on sample) of S and the error rate on sample.

If one resort to Chebyshev's inequality encountered earlier, it is easy to see that

Pgb{K(f) — R(f)|>£) < f—f for a given classifier f. Since m^ jR(f)(1 — R(f)) = 1, we have

Jflzf < M .

v -5 V 4-5

Based on Chebyshev's inequality, for a given classifier f, with a probability of at least 1 — 5, the bound on the difference between the true risk R(f) and the empirical risk

Rn(f) is given by

Rn(f) — R(f)l <

A tighter bound is derived from Hoeffding's inequality mentioned earlier. More specifically, for a fixed function f and for any 5 G (0,1),

/m2

R(f) < Rn(fHy^f.

with probability at least 1 — 5.

Fact 1. The bound yielded by Hoeffding's inequality is tighter than the one derived from Chebyshev's inequality.

Proof. Clearly, we need to find out which of ln2/5 or 1/25 is larger. This is the same as comparing exp(1/25) and 2/5, which in turns means comparing a(2/f) and 2/5 where a = exp(1/4). With 5 > 0, a(2/f) > 2/5, so that, we know that Hoeffding's bounds are tighter. The graph also confirms this. □

0.5 0.45 0.4 0.35

■p

f 0.25 |

0.15 0.1 0.05

0

0 2000 4000 6000 8000 10000 12000

n = Sample size

Figure 2. Chernoff vs Chebyshev bounds for proportions: 5 = 0.01

0.25

0.2

~ 0.15

.8

J

0.05

0

0 2000 4000 6000 8000 10000 12000

n = Sample size

Figure 3. Chernoff vs Chebyshev bounds for proportions: 5 = 0.05

Chernoff vs Chebyshev bounds for proportions: delta = 0.01

Chernoff vs Chebyshev bounds for proportions: delta = 0.05

In all the above, we only addressed pointwise convergence of Rn(f) to R(f), i.e., for a fixed machine f ef, we studied the convergence of

Prob{R(f) - R(f)|>e} to 0.

Needless to mention that pointwise convergence is of very little use here for reasons mentioned in great detail in section 1. Indeed, when one writes for a fixed function f and for any 5 e (0,1),

iln^

R(f) <rZ(f) + y-2n.

with probability at least 1 — 5, one actually means the following: If we choose a fixed function f and then collect many different samples z e Zn, then the corresponding empirical risks for most of those samples a proportion of at least 1 — 5 of them will be close to the true risk. However, for a fixed sample z e Zn, one can find a function for which the difference between the empirical risk and the theoretical risk is very large, especially if the function class f is large enough. Therefore, instead, for bounds to be useful, they have to apply to all function in f, not just pointwise. A more interesting issue to address is uniform convergence. That is, for all machines, f e f, determine the necessary and sufficient conditions for the convergence of

ProbI supR(f) — R(f )| >4 to 0.

4. Derivation of Bounds on the Generalization

Error

4.1. Bounds for Finite Function Classes

Suppose that the function class f has m functions in it, i.e., f = {f1, f2,..., fm}. One seeks to derive bounds that apply at once to all the functions in f. In other words, instead of stopping at the inequality

Prob{R(f) — R(f)|>e} < 2e-2-2, one needs to compute the proportion

Prob |sup|RZ(f) — R(f )| > ^ = Prob{f e f : |RZ(f) — R(f )| > 4

The supremum of all deviations is greater than e if there exists at least one function whose deviation is greater than e.

Lemma 1. The function class f has m functions in it, i.e., f = {f1, f2,..., fm}, then Hoeffding's inequality is extended to the supremum so that, Ve > 0,

Prob J sup|RZ(f) — R(f )| >4 < 2me-2ne2. [feF J

Proof.

Prob { sup|Rn(f) — R(f )| >4 = Prob {3f e f : |R£(f) — R(f )| > e} =

zeZn i fI zeZn

{m m

U{|RZn(fi) — R(fi)| >e}\ < E P|iob {|RZn(fi) — R(fi)| >e} < 2me-2nf2.

i=1 J i=1

□

Proposition 1. If the function class F is finite, i.e. F = {/1, /2,..., fm}, where m = |F| = #F = Number of functions in the class F, then, for all / e F,

ln m + ln I

1/2

R(f) < Rp(f) +

with probability at least 1 — 5, V5 > 0. Proof. It is obvious that for all f G F,

|Rp(f) — R(f)l < supK(f) — R(f)|. f eF

Therefore, by virtue of the above lemma, Vf G F,

|Rn(f) — R(f)l < 2me-2ne2. For all 5 G (0,1), setting 5 = 2me-2ne2 yields

, „ , ^ , ^, / ln m + ln 2 Vf g F, |Rp(f) — R(f)| ^-

with probability at least 1 — 5, as required. □

Since the above result applies to all the functions in F, it therefore applies to

fn = arg mmRP(f)

which is the function constructed by the algorithm to classify the data. One can therefore safely write,

/, , 2\ 1/2

- , - / ln m + ln 2 \

R(fP < Rp(fn) + ( —,

thereby providing a bound on the true error for the classifier in hand. The term ln m reflects the fact that the bound on the generalization error must hold for all the functions in the class F.

While the above results help compare the empirical risk to the theoretical risk across the function class, there remains the need to compare the true risk of the constructed function R(fn) to the smallest risk in the class or the smallest risk overall. The following theorem helps do just that.

Theorem 5. If Rn(f) and R(f) are close for all f G F, i.e., Ve > 0, sup |Rp(f) — R(f)| < e, then Rf) — R(f) < 2e.

f eF

Proof. Recall that we did define fn as the best function that is yielded by the empirical risk Rn(f) in the function class F. Recall also that Rn(fP can be made as small as possible as we saw earlier. Therefore, with f being the best true risk in class F, we always have

Rp(f) — RP(fn) > 0.

As a result,

R(/Z = - R(/) + R(/) = Rp(/) - Rp(f) + R(/Z - R(/) + R(/) <

< 2sup|R(f) — Rn(f )l + R(f).

f eF

Consequently,

R(fn) — R(f) < 2sup|R(f) — Rn(f)|

f eF

as required. □

Corollary 1. A direct consequence of the above theorem is the following:

i 2 \ 1/2

/ < RiD + n'^^n^i) (10)

with probability at least 1 — 5, V5 > 0, where as before

f = arg/inf R(/) and = argminRZ(/).

Equation (10) is of foundational importance, because it reveals clearly that the size of the function class controls the uniform bound on the crucial generalization error: Indeed, if the size m of the function class f increases, then R(f) is caused to increase while R(fn) decreases, so that the trade-off between the two is controlled by the size m of the function class. This raises the natural question as to what happens when the function class is infinite dimensional. Indeed, for infinite dimensional function spaces, one will need to introduce such concepts as the capacity of the function space, measured through devices such as the VC-dimension and covering numbers.

4.2. Statistical Theory for Infinite Dimensional Spaces

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

When the function class F is uncountable, one way to define its "size" is by way of the samples on which the functions from that class operate. In the case of binary classification for instance, the number of labellings (—1, +1) yielded by separating hyperplanes on a given sample z e Zn of size n is finite even though the number of hyperplanes itself is infinite. One could define

Fz = {(/(zi),f(Z2),...,f(zn)): / eF}

to be the set of ways in which z = (z1, z2,..., zn) can be classified.

Def 2. The growth function is the maximum number of ways in which n points can be classified by a function class. More specifically,

Sf(n) = sup |Fz|.

z 6 Z n

For the binary classification task, there are 2n ways to classify a sample of size n into two classes {= 1, +1}. Therefore, for binary classification Sf(n) < 2n for any function class F. The following theorem by Vapnik and Chervonenkis plays a foundational role in statistical learning theory.

Theorem 6 (Vapnik-Chervonenkis). For any 5 e (0,1),

D eF, R(f) < Rn(f) + 2\ 2

z^ o,/olnSF(2n) + ln 2

2n

with probability at least 1 — 5.

It is therefore crucial to be able to compute the growth function Sf(n) of a given function class f. The concept of VC-dimension provides a way to manipulate the growth function.

Def 3 (VC-dimension). The VC-dimension h of a function class F is the largest n such that

Sf (n) = 2n.

It is shown [1] that, if F is the class of separating hyperplanes in a p dimensional space, then

VC dim(F) = h = p + 1.

The VC-dimension and the growth function for that matter can be viewed as as measures of the effective size of a function class. In a sense, by "projecting" the function class onto a finite sample, one avoids simply counting the number of functions in that class, and instead one captures the geometry of the function class and can therefore compute finite quantity that measure the size of that class relative to the finite sample.

The question remains: how does the VC-dimension help provide bounds on the generalization error when the function class is infinite dimensional. The answer requires the following lemma.

Lemma 2 (Vapnik and Chervonenkis, Sauer, Shelah). Let F be a function class with finite VC-dimension h. Then

Sf(n) < £ ( ™ )' and f°r al1 n ^ h SF(n) <

From the above lemma, the following result is derived immediately: If a function class F has finite VC-dimension h, then for all 5 G (0,1),

V/ eF, R(/) < RKf ) + 2^

(^T + l) + In

n

(ll)

with probability at least 1 — 5. As a consequence of the above result, it turns out that the difference between the empirical risk and the true risk of order at most ^(h ln n)/n. A finite VC-dimension ensures that the empirical risk converges uniformly over the class to the true risk.

5. Conclusion and Discussion

h

2

This paper has provided a general introduction to the theory underlying statistical function estimation and pattern recognition. As mentioned in section 1, there are indeed four pillars of statistical learning theory, of which we have touched on the first two, namely, (a) the necessary and sufficient conditions for the consistency of the Empirical Risk Minimization (ERM) principle, and (a) the derivation of non-asymptotic rates of convergence thereof. The remaining two aspects of the foundation, namely (c) the control of complexity and (d) the construction of learning algorithms requires a substantial amount of space. The introduction provided here sheds enough light onto the usefulness and the challenges of this field. Clearly, while it is good to build classifiers, it is crucial to study their theoretical properties, and that's what this field provides.

The field is vast and the topics and variety and many. A more detailed account of the topic with applications and more advanced theoretical developments can be found in the cited references.

References

1. Vapnik V. N. The Nature of Statistical Learning Theory // Springer. — 2000.

2. Bousquet O. Statistical Learning Theory. Machine Learning Summer School. — Tuebigen, Germany, 2003. — http://www.kyb.mpg.de/publication.html?user= bousquet.

3. Bousquet O., Boucheron S., Lugosi G. Introduction to Statistical Learning Theory. Advanced Lectures on Machine Learning // Lecture Notes in Artificial Intelligence. — Vol. 3176. — 2004. — Pp. 169-207.

4. Tipping M. E. Sparse Bayesian Learning and the Relevance Vector Machine // Journal of Machine Learning Research. — Vol. 1. — 2001. — Pp. 211-244.

5. Cucker F., Smale S. On The Mathematical Foundations of Learning // Bulletin of the American Mathematical Society. — Vol. 39, No 1. — 2001. — Pp. 1-49.

6. Evgeniou T., Pontil M., Poggio T. Statistical Learning Theory: A Primer // International Journal of Computer Vision. — Vol. 38, No 1. — 2000. — Pp. 9-13.

7. Heisele B., Verri A., Poggio T. Learning and Vision Machines // Proceedings of the IEEE. — Vol. 90, No 7. — 2002. — Pp. 1164-75.

УДК 519.2, 004.93

Фундаментальные аспекты теории статистического оценивания функций и распознавания образов

Э. Фокоуэ

Университет им. Кеттеринга г. Флинт, Мичиган, США, 48504

Статья представляет собой краткий обзор фундаментальных идей, концепций и результатов теории статистического оценивания функций и распознавания образов. Материал опирается на теорию Вапника—Червоненкиса. Особое внимание уделяется тому, чтобы помочь читателю оценить важность распространения классического закона больших чисел на функциональные пространства и ключевую роль, которую играют такие новые понятия, как принцип минимизации эмпирического риска, состоятельность оценок при построении алгоритмов, обеспечивающих оценку функций с оптимальными свойствами.

Foundational aspects of theory of statistical function estimation and pattern recognition Текст научной статьи по специальности «Математика»

Аннотация научной статьи по математике, автор научной работы — Fokoue E.

Похожие темы научных работ по математике , автор научной работы — Fokoue E.

Текст научной работы на тему «Foundational aspects of theory of statistical function estimation and pattern recognition»