Why deep learning methods use kl divergence instead of Least Squares: a possible pedagogical explanation

Kosheleva O.; Kreinovich V.

Mathematical Structures and Modeling 2018. N. 2(46). PP. 102-106

UDC 004.85 DOI: 10.25513/2222-8772.2018.2.102-106

WHY DEEP LEARNING METHODS USE KL DIVERGENCE INSTEAD OF LEAST SQUARES: A POSSIBLE PEDAGOGICAL EXPLANATION

Olga Kosheleva

Ph.D. (Phys.-Math.), Associate Professor, e-mail: [email protected]

Vladik Kreinovich Ph.D. (Phys.-Math.), Professor, e-mail: [email protected]

University of Texas at El Paso, El Paso, Texas 79968, USA

Abstract. In most applications of data processing, we select the parameters that minimize the mean square approximation error. The same Least Squares approach has been used in the traditional neural networks. However, for deep learning, it turns out that an alternative idea works better — namely, minimizing the Kullback-Leibler (KL) divergence. The use of KL divergence is justified if we predict probabilities, but the use of this divergence has been successful in other situations as well. In this paper, we provide a possible explanation for this empirical success. Namely, the Least Square approach is optimal when the approximation error is normally distributed — and can lead to wrong results when the actual distribution is different from normal. The need to have a robust criterion, i.e., a criterion that does not depend on the corresponding distribution, naturally leads to the KL divergence.

Keywords: Deep learning, Kullback-Leibler divergence.

1. Formulation of the Problem

Machine learning: reminder. The main problem of machine learning is:

• given input-output patterns (yx(k), y(k)),

• to come up with a function f (x) for which f (yx(k)) & y(k).

This function can then be used to predict the output y for other inputs x. In each model of machine learning:

• we have a function f (x, c) depending on some parameters c = (c1,c2,...), and we need to find the values of these parameters for which the resulting values z(k) =f j (x(k),c) are approximately equal to the given value y(k):

z(fc) & y(k).

How to describe this approximate equality: traditional approach. Traditionally, in machine learning, the Least Squares approach was used to describe the desired approximate equality of:

• the result z(k) of applying the model f (x,c) to the input x(k) and

Mathematical Structures and Modeling. 2018. N.2(46)

103

• the given outputs y(k);

see, e.g., [1]. Specifically, most traditional methods minimize the sum

£(*(fc) - y(k) )2.

(i)

Deep learning techniques use KL divergence instead. In deep learning, it turned out that better results are obtained if, instead of the least squares technique (1), we use the Kullback-Leibler (KL) divergence; see, e.g., [2,3]. Specifically, we re-scale the values y(k) and z(fc) so that these values are always between 0 and 1, and then minimize the following objective function:

Why? At first glance, the least squares is a reasonable criterion, the one most frequently used in statistical data processing; see, e.g., [4]. So why is the alternative approach working better?

For the case when the predicted values y(k) are probabilities, an explanation is given in [2], Section 5.5. However, the criterion (2) is also successfully used in many applications in which the predicted values y(k) are not probabilities. How can we explain this success?

What we do in this paper. In this paper, we extend the existing probability-case explanation to the general case, thus providing a possible explanation of why KL divergence works well.

2. Our Explanation

Why least squares: reminder. In order to explain why KL divergence is efficient, let us first recall why the Least Squares method is often used.

Ideally, the deviations z(fc) — y(k) should be all 0s, but in reality, we can only attain an approximate equality. In different situations, we get different values of these deviations. It is therefore reasonable to view these deviations as random variables.

In practice, many random variables are normally distributed; see, e.g., [4]. It is therefore reasonable to assume that the deviations z(fc) — y(k) are normally distributed, with 0 means and some standard deviation a. The corresponding probability density function is thus equal to

Y,[y(k) • log (^(fc)) + (i — y(k)) • log (i — ^(fc))].

(2)

(3)

Since we do not have any reason to believe that different deviations are positively or negatively correlated, it is reasonable to assume that different deviations are

independent. In this case, for each tuple c, the probability (density) is equal to the product of the corresponding probabilities (3), i.e., equal to

K

p(c) = n

k=1

1 ( (z(k) — y(k)Y

• exp 1

v^ • <r "l 2a2

—

(4)

where z(k) = f(x(k),c). It is reasonable to select the tuple c which is the most probable, i.e., for which the expression (4) is the largest possible; this natural idea is known as the Maximum Likelihood approach.

Maximizing the expression (4) is equivalent to minimizing its negative logarithm

K t(k) — y(k)s

— ln(P(c)) = const + 22-—-

k=1

and this minimization is equivalent to minimizing the Least Squares expression (1). Need to go beyond the Least Squares. While many practical probability distributions are normal, there are also many cases when the probability distribution is different form normal; see, e.g., [4]. In such cases, the Least Squares method is not optimal — and it can be very far form optimal. For example, if we have a distribution with heavy tails, for which the probability of large deviations is high, the Least Squares method often leads to erroneous estimates.

This can be illustrated on the simple example when the model f (x,c) = c1

is simply an unknown constant c1. In this case, if we minimize the sum

k 2

Y^ (x(k) — ci) — by differentiating by c1 and equating the derivative to 0 — we

k=1

1 K

get the estimate c1 = — • y(k\ If the actual value of c1 is, e.g., 10, and we get

K ^—'

k=1

K = 100 values close to 10, then the arithmetic average is indeed close to 10. But if one of the estimates is an outlier, e.g., x(1 = 106, then the arithmetic average is close to 10,000 — way beyond the actual value 10.

To take this non-normality into account, we need to replace the Least Squares approach with a one which is robust, in the sense that it does not depend on the probability distribution of the divergence.

Main idea. In the computer, any real value is represented as 0s and 1s. To transform a real-valued signal into a sequence of 0s and 1s, measuring instruments use analog-to-digital converters. These converters are usually based on comparing the actual value with some threshold values. For example, if we the actual value is between 0 and 1, then, by comparing this value with 0.5, we can tell whether the first bit in the binary expansion is 0 or 1:

• if the actual value is smaller than 0.5, then the first bit is 0, and

• if the actual value is larger than 0.5, then the first bit is 1.

By selecting a second threshold to be 0.25 or 0.75, we can determine the second bit, etc.

The thresholds are not necessarily binary-rational numbers: often, other thresholds are used, and then the resulting number is recovered from the results of the

2

Mathematical Structures and Modeling. 2018. N.2(46)

105

corresponding comparisons.

We want to come up with a probabilistic interpretation; thus, it makes sense to select random thresholds. The simplest possible random number generator generates values uniformly distributed on the interval [0,1]. Such random number generators are included in most programming languages.

Resulting setting. So, to describe each value y(k), let us run this simplest random number generator a large number of time N, and store N results of comparing y(l) with the corresponding random numbers r»:

• we store 1 if ri ^ y(k), and

• we store 0 if ri > y(k).

For a random number uniformly distributed on the interval [0,1], the probability to be in each interval is equal to the width of this interval. In particular, the probability to tie smaller than or equal to y(k) — i.e., the probability to be in the interval [0,y(k)] — is equal to y(k). Thus, for large N, we have:

• approximately N • y(k) 1's and

• approximately N • (1 — y(k)) 0s.

For each of K patterns, we have N 0-1 records, so overall, we have a long sequence of N•K records corresponding to all K patterns. This sequence corresponds to the observations.

Derivation of KL divergence. We want to find the tuple c that best fits the above long sequence of observations.

For each tuple c and for each pattern k, the model f (x,c) returns the value z(k) = f (x(k),c). Thus:

• the probability to get 1 when we compare this value with a random value ri is equal to z(k\ and

• the probability to get 0 is equal to the remaining probability 1 — z(k). So, for each pattern, we have:

• N • y(k) observations with probability z(k\ and

• N • (1 — y(k)) observations with probability 1 — z(k).

Assuming — as before — that all observations are independent, we conclude that the probability of observing the given sequence of 0s and 1s is equal to the product

of all these probabilities, i.e., to the value

-y(k) • (1 — z <i-y(k)).

The overall probability can be obtained by multiplying probabilities corresponding to all K patterns. Thus, we select a tuple c for which the following expression is the largest possible:

(z(k))N•y(k) • (1 — z(k)^•(l-y(k))~ . Maximizing this expression p is equivalent to minimizing its negative logarithm

K

— ln(p) = — ^[N • y(k) • ln (z(k)) + N • (1 — y(k)) • ln (1 — z(k))] . k=1

def TT

p = U

This expression is N times larger than the KL divergence (3). Thus, minimizing this expression is indeed equivalent to minimizing the KL divergence.

So, we indeed get the desired explanation for minimizing the KL divergence.

Acknowledgments

This work was supported in part by the National Science Foundation grant HRD-1242122 (Cyber-ShARE Center of Excellence).

References

1. Bishop C.M. Pattern Recognition and Machine Learning. New York : Springer Verlag, 2006.

2. Goodfellow I., Bengio Y., Courville A. Deep Leaning. Cambridge : MIT Press, 2016.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

3. Liu P., Choo K.-K.R., Wang L., Huang F. SVM or deep learning? A comparisive study on remote sensing image classification // Soft Computing. 2017. V. 21. P. 7053-7065.

4. Sheskin D.J. Handbook of Parametric and Nonparametric Statistical Procedures. Boca Raton : Chapman and Hall/CRC, 2011.

ПОЧЕМУ МЕТОДЫ ГЛУБОКОГО ОБУЧЕНИЯ ИСПОЛЬЗУЮТ РАССТОЯНИЕ КУЛЬБАКА-ЛЕЙБЛЕРА ВМЕСТО МЕТОДА НАИМЕНЬШИХ КВАДРАТОВ: ВОЗМОЖНОЕ ПЕДАГОГИЧЕСКОЕ ОБЪЯСНЕНИЕ

О. Кошелева

к.ф.-м.н., доцент, e-mail: [email protected] В. Крейнович

к.ф.-м.н., профессор, e-mail: [email protected]

Техасский университет в Эль Пасо, США

Аннотация. В большинстве приложений обработки данных мы выбираем параметры, которые минимизируют среднеквадартичную ошибку приближения. Аналогичный метод наименьших квадратов использовался в традиционных нейронных сетях. Однако, оказалось, что для глубокого обучения лучше работает альтернативная идея, а именно: минимизация расстояния Кульбака-Лейблера. Использование расстояния Кульбака-Лейблера оправдано, если мы прогнозируем вероятности, но использование этого расстояние было успешным и в других ситуациях. В этой статье мы приводим возможное объяснение этого эмпирического успеха. А именно: метод наименьших квадратов оптимален, когда ошибка аппроксимации распределяется по нормальному закону, и может привести к неправильным результатам, когда фактическое распределение отличается от нормального. Необходимость иметь надежный критерий, т.е. критерий, который не зависит от соответствующего распределения, естественным образом приводит к расстоянию Кульбака-Лейблера

Ключевые слова: глубокое обучение, расстояние Кульбака-Лейблера.

Дата поступления в редакцию: 27.12.2017

Why deep learning methods use kl divergence instead of Least Squares: a possible pedagogical explanation Текст научной статьи по специальности «Математика»

Аннотация научной статьи по математике, автор научной работы — Kosheleva O., Kreinovich V.

Похожие темы научных работ по математике , автор научной работы — Kosheleva O., Kreinovich V.

Текст научной работы на тему «Why deep learning methods use kl divergence instead of Least Squares: a possible pedagogical explanation»