Научная статья на тему 'Parameters Estimates on Samples with Contamination'

Parameters Estimates on Samples with Contamination Текст научной статьи по специальности «Математика»

CC BY
10
0
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
сonfidence probability / sample with contamination / MCD algorithm / доверительная вероятность / засоренная выборка / алгоритм MCD

Аннотация научной статьи по математике, автор научной работы — Yao Keyu

Parametric estimates based on contaminated samples are considered in the paper. The paper provides an overview of algorithms for estimating the mean and the variance for a one-dimensional sample, as well as estimating the mean vector and the covariance matrix for a multidimensional sample. The paper uses the Minimal Covariance Determinant (MCD) algorithm adapted for one-dimensional sample and the MCD algorithm for multidimensional sample. The parameters are estimated on a subsample, the size of which is determined by a given confidence probability. Examples for samples with different levels of contamination are considered. In both examples, the sample was a union of two subsamples. The first subsample, the main one, was generated by normal distribution laws. The second subsample, auxiliary, was generated by different distribution laws. The examples demonstrate the dependence of the estimation accuracy on the confidence level and contamination. The figures illustrate the operation of the MCD algorithm. The main idea of the paper is to show the robustness of the MCD algorithm.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Parameters Estimates on Samples with Contamination

Рассмотрены параметрические оценки по засоренным выборкам. Представлен обзор алгоритмов оценки среднего и дисперсии по одномерной выборке, а также оценки вектора среднего и ковариационной матрицы по многомерной выборке. Для этого используются алгоритм Minimal Covariance Determinant (MCD), адаптированный к одномерной выборке, и алгоритм MCD для многомерной выборки. Параметры оцениваются на подвыборке, объем которой определяется заданной доверительной вероятностью. Рассмотрены примеры для выборок с разным уровнем засоренности. В обоих примерах выборка являлась объединением двух подвыборок. Первая подвыборка (основная) порождалась нормальными законами распределения, вторая подвыборка (вспомогательная) – иными законами распределения. На примерах продемонстрирована зависимость точности оценки от доверительной вероятности и засоренности. На рисунках показана работа алгоритма MCD. Основная идея статьи – показать надежность алгоритма MCD.

Текст научной работы на тему «Parameters Estimates on Samples with Contamination»

ISSN 1026-2237 BULLETIN OF HIGHER EDUCATIONAL INSTITUTIONS. NORTH CAUCASUS REGION. NATURAL SCIENCE. 2024. No. 1

Original article UDC 519.2

doi: 10.18522/1026-2237-2024-1 -56-62

PARAMETERS ESTIMATES ON SAMPLES WITH CONTAMINATION

Keyu Yao

Southern Federal University, Rostov-on-Don, Russia iao@sfedu.ru

Abstract. Parametric estimates based on contaminated samples are considered in the paper. The paper provides an overview of algorithms for estimating the mean and the variance for a one-dimensional sample, as well as estimating the mean vector and the covariance matrix for a multidimensional sample. The paper uses the Minimal Covariance Determinant (MCD) algorithm adapted for one-dimensional sample and the MCD algorithm for multidimensional sample. The parameters are estimated on a subsample, the size of which is determined by a given confidence probability. Examples for samples with different levels of contamination are considered. In both examples, the sample was a union of two subsamples. The first subsample, the main one, was generated by normal distribution laws. The second subsample, auxiliary, was generated by different distribution laws. The examples demonstrate the dependence of the estimation accuracy on the confidence level and contamination. The figures illustrate the operation of the MCD algorithm. The main idea of the paper is to show the robustness of the MCD algorithm.

Keywords: rnnfidence probability, sample with contamination, MCD algorithm

For citation: Yao Keyu. Parameters Estimates on Samples with Contamination. Bulletin of Higher Educational Institutions. North Caucasus Region. Natural Science. 2024;(1):56-62. (In Russ.).

This is an open access article distributed under the terms of Creative Commons Attribution 4.0 International License (CC-BY 4.0).

Научная статья

ОЦЕНКА ПАРАМЕТРОВ ПО ЗАСОРЕННЫМ ВЫБОРКАМ

Кэюй Яо

Южный федеральный университет, Ростов-на-Дону, Россия iao@sfedu.ru

Аннотация. Рассмотрены параметрические оценки по засоренным выборкам. Представлен обзор алгоритмов оценки среднего и дисперсии по одномерной выборке, а также оценки вектора среднего и ковариационной матрицы по многомерной выборке. Для этого используются алгоритм Minimal Covari-ance Determinant (MCD), адаптированный к одномерной выборке, и алгоритм MCD для многомерной выборки. Параметры оцениваются на подвыборке, объем которой определяется заданной доверитель-

© Yao Keyu, 2024

ISSN 1026-2237 ИЗВЕСТИЯ ВУЗОВ. СЕВЕРО-КАВКАЗСКИЙ РЕГИОН. ЕСТЕСТВЕННЫЕ НАУКИ._2024. № 1

ISSN 1026-2237 BULLETINOFHIGHEREDUCATIONALINSTITUTIONS. NORTHCAUCASUSREGION. NATURAL SCIENCE. 2024. No. 1

ной вероятностью. Рассмотрены примеры для выборок с разным уровнем засоренности. В обоих примерах выборка являлась объединением двух подвыборок. Первая подвыборка (основная) порождалась нормальными законами распределения, вторая подвыборка (вспомогательная) - иными законами распределения. На примерах продемонстрирована зависимость точности оценки от доверительной вероятности и засоренности. На рисунках показана работа алгоритма MCD. Основная идея статьи - показать надежность алгоритма MCD.

Ключевые слова: доверительная вероятность, засоренная выборка, алгоритм MCD

Для цитирования: Yao Keyu. Parameters Estimâtes on Samples with Contamination // Изв. вузов. Сев.-Кавк. регион. Естеств. науки. 2024. № 1. С. 56-62.

Статья опубликована на условиях лицензии Creative Commons Attribution 4.0 International (CC-BY 4.0).

Introduction

Let us assume that the confidence probability a and the sample V = p are given. In this case the subsample for evaluation the parameters will contain L = [a\V\]+1 the elements of the sample. The choice of parameter L and subspace play a key role. In the paper various models of contamination are considered, in which the main generator is the normal law with known parameters, and using numerous examples, a connection is established between L and estimates of the parameters of the normal law based on a contaminated sample.

One-dimensional MCD algorithm

The problem is to choose the subsample of given volume on which sample variance will be minimal.

Algorithm

0 step. To choose in arbitrary way initial subsample H volume L.

t step. To evaluate sample mean m and sample variance a of H . Sort elements of sample V in descending order of values pt - mt f . To construct subsample H1 containing the first L elements of the sample V. To evaluate sample mean m1 and sample variance a1 of H1.

Stopping condition. If a1 < a1, then H := H1, return to t step; else stop.

Theorem 1. Algorithm is monotone and converges in a finite number of steps to a local minimum of the objective function.

Example 1. Let us consider a sample V size \V\ = 300, consisting of normal random variables

with given variance a = 1 and average m = 0. The sample is contaminated with independent random variables, that have a uniform distribution on [-1,1 ]. Calculate sample mean and sample variance (table 1).

I. Contamination is equal 10 %. Sample mean - 0.080 and sample variance - 0.935.

II. Contamination is equal 15 %. Sample mean - 0.103 and sample variance - 0.904.

III. Contamination is equal 20 %. Sample mean - 0.140 and sample variance - 0.871.

From table 1 we can conclude, that m and a tends to real values of parameters when confidence probability is closely to contamination level.

ISSN 1026-2237 BULLETIN OF HIGHER EDUCATIONAL INSTITUTIONS. NORTH CAUCASUS REGION. NATURAL SCIENCE. 2024. No. 1

Table 1

Dependence of m on a / Зависимость m от a

a m ;a

I II III

Pollution, %

10 15 20

0.5 ( L =151) 0.071; 1.003 0.073; 0.951 0.135; 1.032

0.7 ( L =211) 0.012; 0.996 0.023; 0.959 0.129; 1.022

0.9 ( L =271) 0.004; 1.001 0.007; 1.011 0.021; 0.983

Multidimensional MCD algorithm

The problem is to choose the subsample of given volume on which the determinant of the sample covariance matrix will be minimal. An exact solution of this problem requires a complete search of options, so the MCD algorithm is used [1-4].

Let us assume that a confidence probability a and a sample V = {p;- , consisting of М-dimensional vectors are given.

Let H1 - a subsample of a sample V with = L. Let us calculate the sample vector of means m(Hj) and the sample covariance matrix C (H1) for this sample. Let us form an ordered permutation n

n() - n(j)«•

(C "1(h1 )(pn() - m(hj)) pn(i) - m(hj ))< (c "1(h1 jp„j) - mh)) pn{j) - mh))

Based on the subsample h1 a subsample h2 = {pn(;): i = 1,.,l} is formed.

MCD algorithm

1. Select an initial subsample h1 with h1 = L .

2. The sample vector of means m(h1) and the sample covariance matrix C(Hj) are calculated. An ordered permutation n is found.

3. A subsample H2 is selected.

4. If AC(h1 )> AC(h2), then h1 := h2, go to step 2; otherwise stop.

We obtained a subsample h1 with = L, a sample vector of means m(h1) and a sample covariance matrix C (hj).

Example 2. Consider a sample V volume |v| = 300, consisting of two-dimensional normal vectors

„ (0.10 0.05 ^

with a given covariance matrix C = and a given vector of mean values m =

, 0.05 0.20 ,

( 0.20 ^

. The

0.15

v / v /

sample is contaminated with two-dimensional vectors with independent components uniformly distributed on a segment . Bellow we will use Euclidian distance for vectors and Frobenius distance for matrices.

I. Contamination is equal 10 %. Distance between m and sample mean m, de (m, m) = 0.075, distance between matrix C and sample covariance matrix C , df = 0.081.

ISSN 1026-2237 BULLETIN OF HIGHER EDUCATIONAL INSTITUTIONS. NORTH CAUCASUS REGION. NATURAL SCIENCE. 2024. No. 1

II. Contamination is equal 15 %. Distance between m and sample mean m, de(m, m) = 0.085, distance between matrix C and sample covariance matrix C , df = 0.091.

III. Contamination is equal 20 %. Distance between m and sample mean m, de(m, m) = 0.102, distance between matrix C and sample covariance matrix C , df = 0.108. The results are presented in the table 2.

From table 2 we can conclude, that Euclidian and Frobenius distances between real and estimated values of parameters decrease when confidence probability is closely to contamination level.

Figures 1, 2 illustrate the operation of the MCD algorithm for the sample contaminated by 10 %, for a =0.5, a =0.7, a =0.9.

In fig. 1 contaminated sample elements are marked in red, and original sample elements are marked in black.

Table 2

Dependence of distances on a / Зависимость расстояний от a

а de;df

I II III

Pollution, %

10 15 20

0.5 ( L =151) 0.051; 0.037 0.078; 0.057 0.084; 0.067

0.7 ( L =211) 0.048; 0.011 0.056; 0.043 0.065; 0.053

0.9 ( L =271) 0.016; 0.009 0.023; 0.013 0.037; 0.026

Fig. 1. Illustration of the sample contaminated by 10 % / Рис. 1. Иллюстрация выборки, засоренной на 10 %

ISSN 1026-2237 BULLETIN OF HIGHER EDUCATIONAL INSTITUTIONS. NORTH CAUCASUS REGION. NATURAL SCIENCE. 2024. No. 1

ISSN 1026-2237 BULLETIN OF HIGHER EDUCATIONAL INSTITUTIONS. NORTH CAUCASUS REGION. NATURAL SCIENCE. 2024. No. 1

с/в

Fig. 2. Illustration of MCD algorithm (the sample is 10 % contaminated): a - a=0.5; b - a=0.7; с - a=0.9 / Рис. 4. Иллюстрация работы алгоритма MCD (выборка засорена на 10 %): а - a=0,5; б - a=0,7; в - a=0,9

Conclusion

In the paper examples of robust evaluation [5] of parameters on contaminated samples are presented. The main algorithm used in this work is the MCD algorithm. Other robust estimation methods can be found in [6, 7]. The MCD algorithm has shown stability in relation to the level of sample contamination and can be recommended as a means of solving problems of this kind. The tables below show the dependence of the error on the confidence level. The figures illustrate the operation of the MCD algorithm.

References

1. Rousseeuw P., Van Driessen K. A fast algorithm for the minimum covariance determinant estimator. Technometrics. 1999;41:212-223.

2. Hubert M., Debruyen M., Rousseeuw J. Minimum covariance determinant and extension. Arxiv: 2017:1709.07045v [stat.ME].

3. Boudt K., Rousseeuw P., Vanduffel S., Verdonck T. The minimum regularized covariance determinant estimator. Statistics and Computing. 2019;30:113-128.

4. Sun P., Freud R. Computation of minimum-volume covering ellipsoids. Operations Research. 2004;52(5):690-706.

5. Hampel F. A general qualitative definition of robustness. Annul of Mathematical Statistics. 1971;42:1887-1896.

6. Beliavsky G., Danilova N., Logunov A. Robust estimation of European and Asian options. Springer Proceedings in Mathematics and Statistics. 2021;357:101-117.

7. Danilova N., Yao K. The minimal ellipsoid and robust methods in the optimal portfolio problem. Engineering Letters. 2022;30(4):1465-1469.

ISSN 1026-2237 BULLETIN OF HIGHER EDUCATIONAL INSTITUTIONS. NORTH CAUCASUS REGION. NATURAL SCIENCE. 2024. No. 1

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Список источников

1. Rousseeuw P., Van Driessen K. A fast algorithm for the minimum covariance determinant estimator // Technometrics. 1999. Vol. 41. P. 212-223.

2. Hubert M., Debruyen M., Rousseeuw J. Minimum covariance determinant and extension // Arxiv. 2017. 1709.07045v [stat.ME].

3. Boudt K., Rousseeuw P., Vanduffel S., Verdonck T. The minimum regularized covariance determinant estimator // Statistics and Computing. 2019. Vol. 30. P. 113-128.

4. Sun P., Freud R. Computation of minimum-volume covering ellipsoids // Operations Research. 2004. Vol. 52 (5). P. 690-706.

5. Hampel F. A general qualitative definition of robustness // Annul of Mathematical Statistics. 1971. Vol. 42. P. 1887-1896.

6. Beliavsky G., Danilova N., Logunov A. Robust estimation of European and Asian options // Springer proceedings in mathematics and statistics. 2021. Vol. 357. P. 101-117.

7. Danilova N., Yao K. The minimal ellipsoid and robust methods in the optimal portfolio problem // Engineering Letters. 2022. Vol. 30, № 4. P. 1465-1469.

Information about the author

K. Yao - Postgraduate Student, High Mathematics and Operations Research Department, Vorovich Institute of Mathematics, Mechanics and Computer Sciences.

Информация об авторе

К. Яо - аспирант, кафедра высшей математики и исследования операций, Институт математики, механики и компьютерных наук им. И.И. Воровича.

Статья поступила в редакцию 25.07.2023; одобрена после рецензирования 15.08.2023; принята к публикации 19.02.2024. The article was submitted 25.07.2023; approved after reviewing 15.08.2023; accepted for publication 19.02.2024.

i Надоели баннеры? Вы всегда можете отключить рекламу.