Научная статья на тему 'James-Stein shrinkage estimator of Shannon entropy in wavelet-filtration systems of complex data'

James-Stein shrinkage estimator of Shannon entropy in wavelet-filtration systems of complex data Текст научной статьи по специальности «Медицинские технологии»

CC BY
65
12
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
ЭНТРОПИЯ ШЕННОНА / SHANNON ENTROPY / МЕТОД СЖАТИЯ ДЖЕЙМСА-СТЕЙНА / JAMES-STEIN SHRINKAGE ESTIMATOR / ВЕЙВЛЕТ-ФИЛЬТРАЦИЯ / WAVELET FILTERING / ПОСЛЕДОВАТЕЛЬНОСТЬ ЭКСПРЕССИЙ ГЕНОВ / GENE EXPRESSION SEQUENCE / ТРЕШОЛДИНГ / THRESHOLDING / FILTERING

Аннотация научной статьи по медицинским технологиям, автор научной работы — Babichev S.A., Lurie I.A., Voronenko M.A.

The paper presents the wavelet-filtering technology of complex data, where Shannon entropy, which was calculated based on James-Stein shrinkage estimator is used as a criterion to evaluate the quality of the information processing. The gene expression sequence, which was obtained by microchip experiments, was used as experimental data. It have been developed the algorithm of the studied data wavelet filtering, where the level of wavelet decomposition and the type of the wavelet is determined based on Shannon entropy maximum value of the deleted from data noise component and the thresholding coefficient value is determined based on the entropy minimum value of the filtering data.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

OЦЕНКА ЭНТРОПИИ ШЕННОНА НА ОСНОВЕ МЕТОДА СЖАТИЯ ДЖЕЙМСА-СТЕЙНА В СИСТЕМАХ ВЕЙВЛЕТ-ФИЛЬТРАЦИИ СЛОЖНЫХ ДАННЫХ

В статье представлена технология вейвлет-фильтрации данных сложной природы, в которой в качестве критерия оценки качества обработки информации используется энтропия Шеннона, рассчитанная на основе метода сжатия Джеймса и Стейна. В качестве экспериментальных данных использовалась последовательность экспрессий генов, полученная посредством микрочиповых экспериментов. Разработан алгоритм вейвлет-фильтрации исследуемых данных, в котором уровень вейвлет-декомпозиции и тип используемого вейвлета определяется на основе максимума энтропии, выделенной с сигнала шумовой компоненты, а значение трешолдингового коэффициента определяется на основе минимума энтропии фильтрованных данных.

Текст научной работы на тему «James-Stein shrinkage estimator of Shannon entropy in wavelet-filtration systems of complex data»

UDK 004.048

S.A. BABICHEV

Jan Evangelista Purkiné University in Ústi nad Labem, Czech Republic

I.A. LURIE, M.A. VORONENKO

Kherson National Technical Univesity, Ukraine

JAMES-STEIN SHRINKAGE ESTIMATOR OF SHANNON ENTROPY IN WAVELET-FILTRATION SYSTEMS OF COMPLEX DATA

The paper presents the wavelet-filtering technology of complex data, where Shannon entropy, which was calculated based on James-Stein shrinkage estimator is used as a criterion to evaluate the quality of the information processing. The gene expression sequence, which was obtained by microchip experiments, was used as experimental data. It have been developed the algorithm of the studied data wavelet filtering, where the level of wavelet decomposition and the type of the wavelet is determined based on Shannon entropy maximum value of the deleted from data noise component and the thresholding coefficient value is determined based on the entropy minimum value of the filtering data.

Keywords: Shannon entropy, James-Stein shrinkage estimator, wavelet filtering, gene expression sequence, filtering, thresholding

С.А. БАБ1ЧЕВ

Ушверситет Яна Евангелиста Пуркше в Уст на Лаб^ Чехiя

1.А. ЛУР'е, М.О. ВОРОНЕНКО

Херсонський нацюнальний техшчний ушверситет, Украша

ОЦ1НКА ЕНТРОПП ШЕННОНА НА ОСНОВ1 МЕТОДУ СТИСНЕННЯ ДЖЕЙМСА ТА СТЕЙНА У СИСТЕМАХ ВЕЙВЛЕТ-ФШЬТРАЦП СКЛАДНИХ ДАНИХ

У cmammi представлено технологiю вейвлет-фшьтрацИ даних складное природи, у якш як критерш оцтки якостi обробки тформацИ використовуеться ентропiя Шеннона, що розрахована на ocHoei методу стиснення Джеймса та Стейна. Як експериментальн дат було використано по^довтсть експрест гетв, що отримана за допомогою мiкрочiпових експериментiв. Розроблено алгоритм вейвлет-фшьтрацИ даних що до^джуються, у якому рiвень вейвлет декомпозицИ i тип вейвлету визначаеться на основi максимуму ентропИ видшено'( з сигналу шумово'1 компоненти, а значення трешолдтгового коефiцiенту визначаеться на основi мiнiмуму ентропИ фыьтрованих даних.

Ключовi слова: ентротя Шеннона, метод стиснення Джеймса-Стейна, вейвлет-фiльтрацiя, по^довнкть експрест гетв, трешолдтг

С.А. БАБИЧЕВ

Университет Яна Евангелиста Пуркине в Усти на Лабе, Чехия

И.А. ЛУРЬЕ, М.А. ВОРОНЕНКО

Херсонский национальный технический университет, Украина

O^ra^ ЭНТРОПИИ ШЕННОНА НА ОСНОВЕ МЕТОДА СЖАТИЯ ДЖЕЙМСА-СТЕЙНА В СИСТЕМАХ ВЕЙВЛЕТ-ФИЛЬТРАЦИИ СЛОЖНЫХ ДАННЫХ

В статье представлена технология вейвлет-фильтрации данных сложной природы, в которой в качестве критерия оценки качества обработки информации используется энтропия Шеннона, рассчитанная на основе метода сжатия Джеймса и Стейна. В качестве экспериментальных данных использовалась последовательность экспрессий генов, полученная посредством микрочиповых экспериментов. Разработан алгоритм вейвлет-фильтрации исследуемых данных, в котором уровень вейвлет-декомпозиции и тип используемого вейвлета определяется на основе максимума энтропии, выделенной с сигнала шумовой компоненты, а значение трешолдингового коэффициента определяется на основе минимума энтропии фильтрованных данных.

Ключевые слова: энтропия Шеннона, метод сжатия Джеймса-Стейна, вейвлет-фильтрация, последовательность экспрессий генов, трешолдинг

Problem statement

Creation of the models of gene regulatory networks based on the gene expression sequences, which are obtained by DNA microchip experiments or by RNA sequencing methods is one of the actual direction of modern bioinformatics. Accuracy of the obtained model work is determined by the quality of the experimental data

preprocessing, one of the steps of which is the filtration of the gene expression sequences, which are obtained by DNA microchip experiments. Scanning process of the DNA microchip data is accompanied by background noise. Partial correction of the noise component is performed by background correction at the stage of gene expression estimation. However, nowadays, it is not possible to remove completely this noise component during gene expression array creation. Owing to the above, there is a necessity to develop the technology of complex high dimensional data filtration based on the modern computing methods of information processing and estimation.

Analysis of recent research and publications

The papers [1,2] are devoted to the questions of visualisation, estimation and preprocessing of high dimensional data. The authors review the common techniques to explore and visualize high dimensional data on examples of gene expression sequences and mass spectrometry protein data. The technology to reduce the uninformativity features in the gene expression array based on the use of the statistical criteria of the studied data estimation is presented in [3]. Implementation of this technology allows reducing of the feature space dimension in the range from 5 to 10%, but it does not solve the problem of noisiness of the remained data. The [4,5] discuss the problem of complex data filtration based on wavelet analyses. The authors proposed the hybrid technology based on complex using the wavelets and Winner filter. However, it should be noted that the problem of the wavelet filter parameters optimisation based on the quantity criteria to estimate the quality of the information processing has not final decision nowadays.

Unsolved parts of the general problem are the absence of the efficient methods of complex high dimension data filtration based on the complex use of modern methods of information processing and quality of the obtained results estimation.

The aim of the paper is development of the technique of high dimensional complex data wavelet filtration, where the estimation of the data processing quality is performed based on Shannon entropy criterion using JamesStein shrinkage estimator.

The presentation of the basis material

Gene expression sequence is a vector the components of which are the expressions of genes, which determine the character of functioning of the appropriate cells of biological organism. Three technologies are actual to determine the gene expression sequences nowadays. These technologies are presented in Fig. 1.

Methods to determine the genes expression of biological objects

DNA Polymerases

DNA Micro Array

RNA Sequencing

Technology of DNA MicroArray experiments data processing

IE

Background correction Normalization ^ PM correction ^Summarization

Gene expression array

Fig. 1. Technologies to determine the gene expression sequences

DNA polymerises and RNA sequencing technologies are more exact in comparison with DNA microarray technology. The gene expression sequences, which are obtained by these technologies, have significantly lower level of noise component, but these technologies are very expensive. DNA microarray technology allows estimating of the gene expression of tens of thousands genes concurrently. This technology is cheaper but the data of genes expressions include the complexity noise component, which is determined by the processes of microchip creation and reading information from it. An example of gene expression sequence of one of the studied objects, which is obtained by DNA microchip method, is shown in Fig. 2.

Fig. 2. An example of gene expression sequence

The statistical characteristics of the studied vector are presented in Table 1.

Table 1

Statistical characteristics of the studied gene expression sequences

Minimum 1 Quantile Median Mean 3 Quantile Maximum

-36,19 2,18 11,10 236,91 22,63 17360,00

As it can be seen, the gene expression sequence includes about seven thousand genes, expression of which is changed in the range from -36,19 to 17360. At the same time, the most of the gene expression have low values. Analysis of Fig.2 and Table 1 allows us to conclude also that the range of change of the gene expression of noise component is significantly lower in comparison with range of change of the gene expression sequence. Moreover, the frequency of the noise component most probably is more in comparison with frequency of the useful component of the studied vector. This fact allows us to use the wavelet analysis to solve the problem of studied data filtration. Wavelets are the families of functions ^ajb (t), which are generated from the basis of mother wavelets by choosing the parameters a (scale parameter) and b (shift parameters) [6,7]:

The process of wavelet processing for purpose of data filtration is presented in Fig. 3.

Fig. 3. The scheme of gene expression sequence wavelet processing

Approximation coefficients on N level and detail coefficients on levels from 1 to N are calculated during the wavelet decomposition process. The noise component in the most cases are in the detail coefficients since these coefficients have higher frequency, therefore the detail coefficients are processed at the next step using certain value of the thresholding coefficient. The soft thresholding was used for detail coefficients processing in case of gene expression sequence. If t - is the thresholding coefficient value and d - is the detail coefficient value, the processing of the detail coessicient in case of the soft thresholding is performed by formula:

d = 0, if d < t, (2)

d = d - t, if d > t.

The signal reconstruction is performed based on the approximation coefficient on N level and processed detail coefficient on levels from 1 to N. The analysis of Fig. 3 allows us to conclude that the wavelet prosessing of the studied data involves the following:

- choise of the mother wavelet;

- determination of the wavelet decomposition level;

- choise of the type of the wavelet from the basis of the mother wavelet;

- determination of the thresholding coefficient value.

Each of these steps involves estimation of the processing quality in order to determine the optimal parameters to process the studied data. To estimate the data processing quality Shannon entropy criterion using James-Stein shrinkage estimator was used. If k - is the quantity of cells with probablities P1,P2,...,Pk, where pt > 0 and

k

^ Pi = 1, then Shannon entropy is defined as quantitative measure of the uncertainty of the system state and it is

i=1

calculated as follow [8]:

k

H = -I Pi lo§2 Pi i=1

(3)

Two diferent models are the basis of James-Stein shrinkage estimator: a high-dimensional model with low bias and high variance and low-dimensional model with larger bias and lower variance [9]. The probablity of JamesStein shrinkage estimator in i cell is calculated by formula:

„Shrink , ¡1 i\„ML

Pi = Äti + (1 - Ä)Pi >

(4)

1

where tj =--is the target probabbility in i cell or probablity in case what all features in i cell are different;

„ML

Pi - is the probability in i cell calculated by maximum-likelihood method:

pML =

(5)

where n - is the quantity of features in i cell, nj, j = 1,...,k - is the quantity of the j-th feature in i cell. X - is the shrinkage intetsity that takes the values from 0 (no shrinkage) to 1 (full shrinkage) and it is calculated by the formula:

A = -

i-I (pMl } i=i_

(n - 1)I t - PML } i=1

(6)

where n - is the quantity of the studied vector features. Taking into account the hereinbefore, the formula to calculate Shannon entropy using James-Stein shrinkage estimator can be presemted as follow:

H

Shrink

k

I

i=1

= -I PShrink log2 PShrink

(7)

Obviously, higher value of Shannon entropy corresponds to lower quantity of useful information in the studied vector. Maximum value of Shannon entropy corresponds to white noise component. Thus, lower value of Shannon entropy of the studied vector or higher value of Shannon entropy of the removed noise component corresponds to better quality of the studied vector processing. The structural block diagram of the wavelet filtration process using James-Stein shrinkage Shannon entropy estimator is presented in Fig. 4.

Initial

data

DWT

CD1...N coeff processing

IDWT

Filtered,

data

Differentiator

Noise

component

J-S shr. Shannon entropy estimator

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Result^

analysis

Fig. 4. Block diagram of the model to process the gene expression sequence using James-Stein shrinkage Shannon entropy estimator

Implementation of the hereinbefore process involves the following steps:

1. Choise of the mother wavelet from the list of the avialable ones.

2. Determination of the optimal level of wavelet decomposition based on Shannon entropy maximum value for the selected noise component. At this step we use randomly chosen type of wavelet from the family of the mother wavelet.

3. Determination of type of wavelet from the mother wavelet family based on the maxomum value of Shannon entropy for the selected noise component.

n

n

n

4. Determination of the thresholding coefficient value based on the minimum value of Shannon entropy for filtered data.

The practical implementation of the presented technology was performed based on the family of Daubechy wavelets. Wavelets db1 (or haar)... db45 were used during simulation process. Determination of the thresholding coefficient value was carried out in two ways. The first way involves step by step removing of the noise component using low constant value of the thresholding coefficient. Estimation of Shannon entropy of the filtered data is performed at each step of the filtration process. The duration of the filtration process is limited by quantity of the noise removing steps. The second way involves step by step increase of the threshilding coefficient value from Tmin to rmax with step dz. The estimation of the filtered data Shannon entropy at each step is performed concurently. The final decision about the optimal step of data processing in both cases is taken out based on the minimum value of Shannon entropy for filtered data. The results of the simulation process for gene expression sequence filtering are shown in Fig. 5. Fig. 6 shows the filtered sequence and the removed from sequence noise component.

Q_ O

C

(1) c G C C

nj

JZ

co

a) Shannon entropy of the noise versus the decomposition level

— — —^ — — —

o

CD

o

lO

4 6 8

Wavelet decomposition level

10

Q_ O

c

d) c o c c to

co

b) Shannon entropy of the noise versus the type of the Daubechi wavelet

if

-T

T

-r

10 20 30 40

Type of the Daubechi wavelet

â 2

I S

c m o c c (O cr> J=

co

c) Shannon entropy of the data versus the step of noise removed

!N

'•f..

*****

10 20 30 40 Step of rioise removed

50

ä o

c 0) c o c c ra

co

d) Shannon entropy of the data versus the thresholding coefficient

* ""

♦ .....

♦ '

* * <

1 2 3

Thresholding coefficient value

Fig. 5. Charts of James-Stein shrinkage Shannon entropy versus the: a) level of wavelet decomposition; b) type of the Daubechi wavelet;

c) step of noise component remove; d) thresholding coefficient value

g a) Filtered data using db5 Daubechi wavelet, Shannon entropy = 3,913

0 100Q 2000 3000 4000 5000 6000 7000

Conditions

g b) Removed noise, Shannon entropy = 6,77

0 1000 2000 3000 4000 5ÛÛÛ 6ÛÛÛ 7000

Conditions

Fig. 6. Results of the simulation: a) filtered gene expression sequence; b) removed noise component

The analysis of the obtained results allows us to conclude that an optimal in terms of Shannon entropy criterion is the gene expression processing using wavelet db5 for two level of wavelet decomposition and thresholding coefficient value 1,8. The comparison of the charts in Fig. 5c and 5d allows us also to conclude that the step by step increase of the thresholding coefficient value with little step of its change is more effective in comparison with step by step removing noise component with constant value of the thresholding coefficient. The local minimum of Shannon entropy value in Fig. 5d is more evident in comparison with local minimum in Fig. 5c.

Conclusion

The paper presents the technology of filtering the gene expression sequence based on the complex use of the wavelet analysis and James-Stein shrinkage estimator. Implementation of this technology allows us to determine the optimal parameters of the wavelet filter in terms of quantitative criterion of the data processing quality estimation. The family of Daubechies wavelet was used during simulation process. The use of James-Stein shrinkage estimator to calculate Shannon entropy is determined by complex character of this method. James-Stein shrinkage estimator takes into account two very diferent models: a high-dimensional model with low bias and high variance and low-dimensional model with larger bias and lower variance. This fact allows us to obtain higher objectivity during estimation of the data processing quality. The wavelet filtering process of the gene expression sequence includes three stages. The first stage involves the level of wavelet decomposition determination. At the second step the choise of Daubechi wavelet type was performed. In these cases the solution is made based on the maximum Shannon entropy value for the removed noise component. The third stage involves the thresholding coefficient value determination based on the minimum Shannon entropy value for the filtered data. Determination of the thresholding coefficient value was carried out in two ways. The first way involves step by step removing of the noise component using low constant value of the thresholding coefficient. Estimation of Shannon entropy of the filtered data is performed at each step of the filtration process. The duration of the filtration process is limited by quantity of the noise removing steps. The second way involves step by step increase of the threshilding coefficient value from rminto Tmax with step dz . The results of the simulation show that the step by step increase of the thresholding coefficient value with little step of its change is more effective in comparison with step by step removing of the noise component with constant value of the thresholding coefficient. Moreover, an optimal in terms of Shannon entropy criterion is the gene expression processing using wavelet db5 for two level of the wavelet decomposition and thresholding coefficient value 1,8. The perspective of the autor's research is the creation of the complex technology of the gene expression sequences preprocessing, where the data filtering will be one of the stages of the studied data processing.

References

1. Wu Z. Exploration, visualization, and preprocessing of high-dimensional data / Z. Wu // Methods Molecular Biology, 2010.- Vol. 620. - P. 267-284.

2. Ozsolak F. RNA sequencing: advances, challenges and opportunities / F. Ozsolak, P.M. Milos // Nature Reviews Genetics, 2011. - Vol.12. - P.87-98.

3. Babichev S. Filtration of DNA nucleotide gene expression profiles in the systems of biological objects clustering /

S.Babichev, M.A.Taiff, V. Lytvynenko // International Frontier Science Letters. - 2016. - Vol. 8.- P.1-8.

4. Joshi A. Analysis of Adaptive Wavelet Wiener Filtering for ECG Signals: Review / A. Joshi, H.S. Aravind // International Journal of Advanced Research in Electronics and Communication Engineering. - 2014. - Vol. 3. -Issue 4.- P. 395-398.

5. Chandu R. ECG Signal Filtering using an Improved Wavelet Wiener Filtering / R. Chandu. M. Venkateswarlu //

International Journal of Advanced Technology and Innovative Research. - 2015. - Vol. 7.- Issue 7.- P. 12421247.

6. Daubechies I. The wavelet transform, time-frequency localization and signal analysis / I. Daubechies // IEEE Trans. Inform. Theory. - 1990. - Vol. 36. - P. 961-1005.

7. Coifman R.R. Wavelet Analysis and Signal Processing / R.R. Coifman, Y. Meyer, M.V. Wickerhauser // Wavelets

and Their Applications. - Boston Jones and Bartlett, 1992.- P. 153-178.

8. Shannon C.E. A mathematical theory of communication / C. E. Shannon // Bell System Technical Journal, 1948. -

V. 27. - P. 379-423, 623-656.

9. Hausser J. Entropy inference and James-Stein estimator, with application to nonlinear gene association networks /

J. Hausser, K. Strimmer // Journal of Machine Learning Research. - 2009. - Vol. 10. - P. 1469-1484.

i Надоели баннеры? Вы всегда можете отключить рекламу.