Methods of analysis of random data and their algorithmization

Tashpolatova B.B.

UDC: 681.518.3

METHODS OF ANALYSIS OF RANDOM DATA AND THEIR

ALGORITHMIZATION

Tashpolatova B.B., student

Financial university under the Government of the Russian Federation, Moscow, Russia

tashpolatova_barno@mail.ru

Research supervisor: Daneev O.V., PhD. econ. Sciences, associate Professor, the Department of Data Analysis, Decision Making and Financial Technologies Financial university under the Government of the Russian Federation, Moscow, Russia

ОСНОВЫ АЛГОРИТМОВ ФУНКЦИОНИРОВАНИЯ ИНТЕЛЛЕКТУАЛЬНЫХ СИСТЕМ

Ташполатова Б. Б., студент

Финансовый университет при Правительстве Российской Федерации, Москва, Россия

tashpolatova_barno@mail. ru Научный руководитель: Данеев О. В., доцент Департамента анализа данных, принятия решений и

финансовых технологий, к.э.н., доцент Финансовый университет при Правительстве Российской Федерации, Москва, Россия

Аннотация. Развитие информационных технологий одновременно сопровождается увеличением количества и разнообразия методов, которые используются для обработки информации. Одним из наиболее эффективных инструментов анализа информации и информационных систем являются вероятностные методы. Они позволяют адекватно описать многие информационные, физические и технологические процессы.

Ключевые слова: вероятность, подход, информация, стандартизация, оптимизация.

In the modern world, the flow and volume of information generated by science, which has become a dire

ct productive force, is constantly growing. DATA PROCESSING AS AN INFORMATION

Characteristic features of Informatization of PROCESS modern society are the following: If we consider the production of an information

1) information has become an important resource product, we will see how the original information of production and has reduced the need for material resource in accordance with the task in a certain and labor resources; sequence undergoes various transformations. The

2) information technology has brought new ongoing information processes reflect the dynamics industries to life; of these transformations. Thus, it can be concluded

3) information has become a commodity; that the information process is the process of

4) information provides additional value to other converting information.

resources, such as labor. It can be said that the processing of information

The reason for this is the chaotic behavior of many consists in obtaining some information objects from

natural objects and technical systems. Probabilistic other information objects by performing some

methods allow us to determine with sufficient algorithms and is thus one of the main operations that

accuracy to what extent the desired value will change, are carried out over the information, and accordingly

or with what probability we can expect any event. are the main means of increasing its volume and

Probabilistic and statistical methods are diversity.

successfully used wherever it is possible to build There are two types of information processing -

and justify the probabilistic model of the studied numeric and non-numeric. For numerical processing process or phenomenon.

of used objects such as variables, vectors, matrices, multidimensional arrays, constants, etc.

In case of non-numeric processing, the objects can be files, records, fields, hierarchies, networks, relationships, etc.in the future, we will consider only the stage of data processing, as well as the use of probabilistic methods of information processing at this stage.

STATISTICAL ANALYSIS OF DATA

There are two areas of statistical data processing.

The first includes methods of mathematical statistics, which provide for the possibility of probabilistic interpretation of the analyzed data and statistical conclusions.

The second direction combines statistical methods, which initially do not rely on the probabilistic nature of the processed data. The second approach is applied when the conditions of the initial data collection do not fit into the statistical ensemble, i.e. in the situation when there is no practical or even fundamentally represented possibility of multiple identical reproduction of the basic set of conditions under which the analyzed data were measured.

Consider the main stages of data processing and briefly describe each of them. To do this, we will present a General logical scheme of statistical analysis of data in the form of stages that can be implemented including the mode of iterative interaction.

At the first stage there is a preliminary analysis of the system under study. At this stage, the main goals of the study are determined at the non-formalized, substantive level; the set of units (objects) representing the subject of the statistical study; a set of parameters-features to describe the objects under study; the degree of formalization of the relevant records in the collection of data; formalized statement of the problem.

In the second phase, a baseline information collection plan is developed. When drawing up a detailed plan for the collection of primary information, the full analysis scheme is taken into account. At this stage, it is determined what the sample should be; the scope and duration of the study; the scheme of the active experiment (if possible) with the involvement of methods of

experiment planning and regression analysis to determine some input variables.

The third stage involves the collection of initial data, their preparation and introduction into the computer for processing.

There are two ways to represent the source data:

object-sign matrix: with the values of the K-th sign, which characterizes the i-th object at the time t:

(t), t = t1^tN, k = (1^), i = (1N) ;

object-object matrix» (t) - characteristics of pairwise proximity of the I-th and j-th objects or features at the moment t.

The fourth stage is the initial statistical processing of the data. This solves the problem of displaying verbal variables in nominal or ordinal scale; the problem of statistical description of the initial sets with the definition of limits of variable variation; the problem of analyzing sharply released variables; the problem of recovering missing values of observations; the problem of statistical independence of the sequence of observations that make up the initial data array; the problem of unification of variable types; experimental analysis of the distribution law of the studied population and parametrization of information about the nature of the studied distributions (this kind of primary statistical processing is sometimes called the process of compiling a summary and grouping.

At the fifth stage, there is a choice of basic methods and algorithms of statistical data processing, preparation of a detailed plan for the computational analysis of the material. The thesaurus of meaningful concepts is replenished and specified. The block diagram of the analysis with the indication of the involved methods is described.

At the sixth stage, there is a direct implementation of the plan of computational analysis of the initial data.

At the seventh stage, a formal report on the study is being built. The results of statistical procedures (parameter estimation, hypothesis testing, mapping to a space of smaller dimension, classification) are interpreted. The methods of simulation modeling can be used in the interpretation.

THE MAIN TYPES OF DEPENDENCIES BETWEEN RANDOM VARIABLES QUANTITATIVE

By the type of dependence between random quantitative variables we mean not an analytical type of function Ycp(X) = f(X,9), and the nature of the analyzed variables (X,y) and, hence, the interpretation of the function f(X,9).

Most often, two types of dependence are considered - regression and correlation.

In the first case, the regression dependence of a random result indicator is considered ^ from non-random predictive variables X.

At the same time, the nature of the analyzed connections can be dual in nature.

a) the measurements indicator ^ are made with error, and the measurement is not a random variable - X without errors.

b) indicator ^ depends not only on X, and so for all X * values X *) subject to scatter.

In this instance X plays the role of the parameter on which the distribution depends

In mathematical form, this case is represented as follows

n(X) = f(X) + e(X) Ycp(X) = Ml^(X) = f(X), Me(X) = 0.

We assume that the nature of the deviation e(X) and its distribution characteristics are not related to the function structure f(X).

In the second case the correlation and regression dependence between random vectors is considered ^ (resulting indicator) and (explanatory variables).

It is assumed that the components of vectors in this case ^ and depend on many factors that cannot be controlled i.e. these variables are random.

Submit ^ in f® + e

e - residual influence of unaccounted factors, including

2

Me(k) = 0,De(k) = ak < &

cov (f(k№, e(k) )= 0.

For a special occasion: m=1; a f(^) - the linear function have:

f

n=0, k) +e

k=1

Y p (x) = 0O +± et

(k)

k=1

If we have e = 0, then the random variables are bound by a purely functional dependence ^=f(^), however, it should be distinguished from the functional dependence of non-random variables. METHODS OF ANALYSIS OF RANDOM DATA

Consider the basic probabilistic methods of data analysis. The most common of them is the analysis of variance. There are several possible implementations of analysis of variance. Given the number of factors and the number of samples available from the population, a relatively simple option could be chosen.

For example, a single-factor analysis of variance can be used to test the hypothesis of similarity of the mean values of two or more samples belonging to the same population. This method can be extended to tests of two means (for example, t-test).

Two-factor variance analysis with repetitions is a complicated version of a single-factor analysis with multiple samples for each data group.

Two-factor variance analysis without repetition is a two-factor variance analysis that includes no more than one sample per group. This method of analysis can be used to test the hypothesis that the mean values of two or more samples are equal, that is, to confirm that the samples in question belong to the same population.

Correlation analysis is a powerful mathematical apparatus for the quantitative determination of the relationship of the two sets of data are presented in dimensionless form. The sample correlation coefficient is the ratio of covariance of two data sets to the product of their standard deviations.

This method provides, for example, the ability to set how large datasets are associated, that is, how large values from one dataset are associated with the same size values from another dataset (positive

correlation), or Vice versa, the small values of one dataset are associated with the larger values of the other (negative correlation), or the data from the two ranges are not connected (zero correlation).

Covariance analysis is the determination of covariance, which is a measure of the relationship between two data ranges. The method can be used to calculate the average product of deviations of data points from relative averages. It makes it possible to determine whether a dataset is associated in magnitude, that is, how large values from one dataset are related to the larger values of another dataset (positive covariance), or Vice versa, the smaller values of one dataset are related to the larger values of another dataset (negative covariance), or whether the data from the two ranges are not related in any way (covariance is close to zero).

Two sample f-test for variance is used to solve the problem of comparing the variances of two General sets. For example, an F-test can be used to identify differences in time-feature variances that are calculated from two samples.

Fourier analysis based on the algorithm of fast Fourier transform (FFT) method is used for solving problems in linear systems and analysis of periodic data sets.

Linear regression analysis consists in the construction of a least-squares graph describing a set of observations. The regression analysis apparatus is used, in particular, to analyze the impact of one or more independent variables on a single dependent variable.

The T-test can be used to test averages for different types of master sets.

The student's two sample t-test tests the hypothesis that two samples have equal means.

Two sample z-test for means with known variances can be used to test the hypothesis of the difference between the average of the two General sets.

Algorithms based on the above methods are included in most standard mathematical packages designed for computer processing of random data.

PROBABILISTIC MODELS OF INFORMATION OPEN SYSTEMS

Probabilistic methods are used not only in the processing of random data. So, the theory of cryptography is based on the use of probabilistic models of open systems.

Probabilistic models of information open systems are used to solve cryptography problems. The system is considered as a source of random sequences. Let's assume that the system generates in the Av

given alphabet X text of finite or infinite length. In this case, we can assume that the source generates a finite or infinite sequence of random letters

X0'X1,X2 Xi'"', which take values in Ax . Determine the probability of a random message

a°'a1'a2'-',anas the probability of the sequence of events:

P(a°, «!,■■■, an-i ) = P(x° = a°, x = a1,■■■, x^ = an_x)

A set of random texts forms a probabilistic space if the conditions are met:

message

and

P(a°,ai,a2,'",an — ) ,, ,

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

v ° is ^ ■> n is for any random

a°, a2,''',an-1 •

;

2 Р(a°, a1, a2,'",an-1 ) = 1

(a°,a1, a2,■■■, an-1) ;

;

for any random message a0,a1,a2,■■■,an—1

fairly

random

S > n

P(a0, ai,a2,...,an_!)= ^P(a0'ai' a2' -' as-i)

(a

.-i )

that is, the probability of all length text extensions n is the sum of the probabilities of this message to the length s .

The text generated by such an information open system is a probabilistic analogue of the language.

By setting a certain probability distribution on the set of open texts, the corresponding model of an information open system is set.

Distinguish between stationary and non-stationary open system information. For stationary

models, it is typical that the probability of a letter (k -grams) does not depend on the place in the open text.

The considered model is convenient for practical use, at the same time some properties of the model

a

contradict the language properties. In particular,

according to this model any k - grams it has a nonzero probability of use. The above does not allow this model to be used for decryption of a wide class of cryptosystems. The probabilistic nature of the processes that take place in the world around us has led researchers to be interested in probabilistic and statistical methods of analysis. Probabilistic methods make it possible to build a relatively simple model of a random process and phenomenon. The mathematical apparatus of these methods is developed quite well, moreover, the algorithms based on probabilistic methods are included in most packages of mathematical data processing on computers. In this paper, two aspects of the use of probabilistic methods in information processing were considered. In the first case, statistical methods of processing random data were considered. The regression model and correlation model were considered as possible models of the relationship of random variables. The description of the main statistical methods of description of random processes is given.

In the second case the application of probabilistic models in the theory of cryptography was considered.

It is shown that the use of probabilistic and statistical methods gives a good result both in the study of real processes and systems, and in the design of information technologies, such as the creation of protected information channels. Processing of experimental data is carried out in order to extract useful information from them for the development and management decisions. Any statistical data processing is their transformation to an easy-to-use form, or translation of the answers of the studied system from the measurement language to the language of the refined model. All methods of probabilistic information processing are divided into three large groups. The first group of methods includes the simplest and primitive calculations of

mean values of a random variable for the observation period. The second group includes methods for calculating the sample characteristics of a random variable. It is necessary to obtain an estimate of the accuracy and reliability of the calculation of averages. The third one includes the methods for determining the probability distribution of random variables.

Probabilistic methods of information processing are widespread in various spheres of life. Such methods have acquired particular importance in the economy. After all, it is in the economic environment that the flow of information increases every day, hour, minute, which inevitably creates an urgent need for data analysis and processing. However, in a world where such a high proportion of uncertainty, a person at least a little closer to the study of certain facts, to reduce to a possible minimum errors in the processing of information. Probabilistic methods of information processing have shown particular value here.

BIBLIOGRAPHY

[1] GOST R 54500.1-2011/ISO/IEC 98-1:2009 Manual. National standard of the Russian Federation. The uncertainty of the measurement. Part 1. Introduction to the manual on measurement uncertainty" (resp. and it is put Into operation by the order of Rosstandart of 16.11.2011 N 555-St)

[2] Pokhodun A. I. Experimental methods of researches. Measurement errors and uncertainties. Tutorial / A. I. Pokhodun - SPb: SPbSU ITMO, 2006. - 112 p.

[3] Efimova, N. Yu. Evaluation of uncertainty in measurement: a practical guide / N. Yu. Efimova. - Meganewton.: BelGIM, 2003. - 50 p.

[4] Zvyagin, L.S. The adaptive Bayesian approach to the synthesis of algorithms for joint detection - Estimation. Proceedings of International Conference on Soft Computing and Measurements, SCM 2015. St. Petersburg; Russian Federation; 2015, pp. 18-20. DOI: 10.1109/SCM.2015.7190398.

[5] Zvyagin, L.S. System modeling in marketing research. Proceedings of the 19th International Conference on Soft Computing and Measurements, SCM 2016. St. Petersburg; Russian Federation; 2017, pp. 79-82. DOI: 10.1109/SCM.2016.7519740.

V V

Methods of analysis of random data and their algorithmization Текст научной статьи по специальности «Математика»

Аннотация научной статьи по математике, автор научной работы — Tashpolatova B.B.

Похожие темы научных работ по математике , автор научной работы — Tashpolatova B.B.

ОСНОВЫ АЛГОРИТМОВ ФУНКЦИОНИРОВАНИЯ ИНТЕЛЛЕКТУАЛЬНЫХ СИСТЕМ

Текст научной работы на тему «Methods of analysis of random data and their algorithmization»