Научная статья на тему 'Model of filtration system of DNA nucleotides gene expression profiles'

Model of filtration system of DNA nucleotides gene expression profiles Текст научной статьи по специальности «Медицинские технологии»

CC BY
45
6
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
ЭКСПРЕССИЯ ГЕНОВ / GENE EXPRESSION / ФИЛЬТРАЦИЯ / FILTRATION / НУКЛЕОТИДЫ ДНК / ТРЕШОЛДИНГ / THRESHOLDING / КЛАСТЕРИЗАЦИЯ / CLUSTERING / DNA NUCLEOTIDE

Аннотация научной статьи по медицинским технологиям, автор научной работы — Babichev S.A.

Researches on an optimization of the filtration process of DNA nucleotides gene expression profiles are presented in the article. Filtration was carried out under the terms of the expression detecting of corresponding gene, herewith the variance of gene expression, the absolute value of expression and the Shannon entropy were used as criteria. The value of thresholding coefficient was estimated on the basis of proximity of average measure of objects within the homogenous group and between groups. Estimation of the quality of information processing was performed by the comparative analysis of the clustering results of processed and unprocessed data.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

МОДЕЛЬ СИСТЕМЫ ФИЛЬТРАЦИИ ПРОФИЛЕЙ ЭКСПРЕССИИ ГЕНОВ НУКЛЕОТИДОВ ДНК

В статье представлены исследования по оптимизации процесса фильтрации профилей экспрессии генов нуклеотидов ДНК. Фильтрация производилась по условиям определения экспрессии соответствующего гена, при этом в качестве критериев использовались дисперсия экспрессии генов, абсолютное значение экспрессии и энтропия Шеннона. Значение трешолдингового коэффициента оценивалось на основании средней меры близости объектов внутри однородной группы и между группами. Оценка качества обработки информации выполнялась посредством сравнительного анализа результатов кластеризации обработанных и необработанных данных.

Текст научной работы на тему «Model of filtration system of DNA nucleotides gene expression profiles»

МАТЕМАТИЧНЕ МОДЕЛЮВАННЯ Ф1ЗИЧНИХI ТЕХНОЛОГ1ЧНИХ ПРОЦЕСШ I ТЕХН1ЧНИХ СИСТЕМ

UDK 004.048

S.A. BABICHEV

Jan Evangelista Purkine University in Ústi nad Labem, Czech Republic

MODEL OF FILTRATION SYSTEM OF DNA NUCLEOTIDES GENE EXPRESSION PROFILES

Researches on an optimization of the filtration process of DNA nucleotides gene expression profiles are presented in the article. Filtration was carried out under the terms of the expression detecting of corresponding gene, herewith the variance of gene expression, the absolute value of expression and the Shannon entropy were used as criteria. The value of thresholding coefficient was estimated on the basis of proximity of average measure of objects within the homogenous group and between groups. Estimation of the quality of information processing was performed by the comparative analysis of the clustering results ofprocessed and unprocessed data.

Keywords: gene expression, filtration, DNA nucleotide, thresholding, clustering

С.А. БАБ1ЧЕВ

Ушверситет Яна Евангелиста Пуркше в Уст на Лаб^ Чехiя

МОДЕЛЬ СИСТЕМИ ФШЬТРАЦП ПРОФ1Л1В ЕКСПРЕСП ГЕН1В НУКЛЕОТИД1В ДНК

У cmammi представлено до^дження по оптимгзацИ процесу фшьтрацИ профiлiв експреси генiв Hymeomudie ДНК. Фiльтрацiя проводилася за умовами визначення експресп вiдповiдного гена, при цьому як критерп використовувалися дисперая експресп гeнiв, абсолютне значення експресп та eнmрoпiя Шеннона. Оцтка значення трешолдингового коефщенту проводилася на пiдсmавi середньо'1' мiрu блuзькoсmi oб'eкmiв всeрeдuнi oднoрiднoi групи та мiж групами. Оцiнка якoсmi обробки тформацп виконувалася за допомогою пoрiвняльнoгo аналiзy рeзyльmаmiв кластеризацп оброблених та необроблених даних.

Ключoвi слова: експреся гетв, фiльmрацiя, нуклеотиди ДНК, трешолдинг, класmeрuзацiя

С.А. БАБИЧЕВ

Университет Яна Евангелиста Пуркине в Усти на Лабе, Чехия

МОДЕЛЬ СИСТЕМЫ ФИЛЬТРАЦИИ ПРОФИЛЕЙ ЭКСПРЕССИИ ГЕНОВ НУКЛЕОТИДОВ ДНК

В статье представлены исследования по оптимизации процесса фильтрации профилей экспрессии генов нуклеотидов ДНК. Фильтрация производилась по условиям определения экспрессии соответствующего гена, при этом в качестве критериев использовались дисперсия экспрессии генов, абсолютное значение экспрессии и энтропия Шеннона. Значение трешолдингового коэффициента оценивалось на основании средней меры близости объектов внутри однородной группы и между группами. Оценка качества обработки информации выполнялась посредством сравнительного анализа результатов кластеризации обработанных и необработанных данных.

Ключевые слова: экспрессия генов, фильтрация, нуклеотиды ДНК, трешолдинг, кластеризация

Problem statement

Functional genomics is one of the actual directions in the field of bioinformatics nowadays. Its main task is analyzing and implementation of transfer information mechanisms, recorded in the genome of biological cells, from gene to feature. On this basis the subsequent identification of the object state is carried out. RNA Sequencing [1] and analysis of DNA microarray data [2-5] are the basic methods of gene expression determining nowadays. Each of these methods has its advantages and disadvantages. RNA sequencing method allows us to obtain the direct information about the RNA molecule nucleotides sequence of the investigated genome that in its turn allows to determine the expression absolute value of the corresponding gene. High cost of the experiment is the main disadvantage of this technology. The low cost, the possibility of simultaneous analysis of tens of thousands genes, the technology available to practical implementation are the advantages of DNA microarray technology. High error of results obtained due to the high level and specificity of noise component that arises at the stage of microarray creating and reading information from this microarray is the main disadvantage of this technology. Thereby the development of effective methods of DNA microarray data preprocessing on the basis of modern computer methods of information processing is highly relevant.

Analysis of recent research and publications

The research papers [3-6] are devoted to issues of DNA microarray data processing. The authors carry out a detailed analysis of various stages of DNA microarray creating, reading information from microarray, and postprocessing in order to estimate the gene expression level of the studied objects. The article [7] presents the results of

MATEMATHHHE MO^EHWBAHHH OBHHHHXI TEXHQJQriVHHX nPQUECm I TEXHIVHHX CHCTEM

the experiments for cancer patient data clustering with the use of different clustering algorithms. The authors conducted researches to choose the optimal group of methods for data preprocessing, and the Shannon entropy was used as the main criterion in this case. However, it should be noted, that issues of data filtration in accordance with specifics of noise (non-specific hybridization) are not considerably paid attention at this works.

Unsolved parts of the general problem are the absence of effective filtering algorithms for DNA microarray data, focused on the removing of nonspecifically hybridized genes as well as genes that do not carry the essential information about the features of the analyzed objects.

The aim of the article is the development of step by step filtering techniques of gene expression profiles of DNA microarray data based on the complex use of different criteria to estimate the gene expression variations of biological objects.

The presentation of the basis material

The matrix of light intensity in size of (n x m) is the result of DNA microarray data

scanning: A = {xj}i = 1,...,n, j = 1,...,m , where i - is the number of experiments carried out or the number of

objects investigated, j - is the number of conditions under which the appropriate genes have been expressed. It is obvious that the level of gene expression should be significantly varied under the same conditions for various radically different biological objects due to the diversity of the occurring biological processes. Therefore, the genes whose expression does not correspond to this condition may be deleted from the data array as uninformative. This fact will be able to increase the resolution for further information analyses. Filtration of DNA microarray data suggests the presence of the following stages:

- removing of genes with missing values of gene expression, arising due to the fact of some samples appeared to be unhybridized;

- removing of genes with low value of profiles variance. Low value of variance indicates insignificant change of the gene expression level during the transition from one object to another, that does not contribute to high quality of subsequent information processing;

- removing of columns with low absolute value of gene expression. Poor hybridization is the reason of the low absolute values of gene expression profiles;

- removing of genes with high absolute value of Shannon entropy [6]:

Ej = -Z p(xij )• log2 p(xij ) (1)

i=1

(Xj) - :

where Ej - is the entropy of j-th gene, p(xj) - is the probability of state realization of j-th gene of i-th object. In

accordance with the classical definition of entropy, it is a quantitative measure of the randomness of the structural elements in the system. The low value of the entropy corresponds to a high normativity due to high ordering of the gene expression level for the set of studied objects. The high entropy corresponds to a high level of disordering of expression profiles distribution for different objects that can be interpreted as noise.

The threshold value of the variation feature, which defines the board of genes set division into information and non-information is determined by the condition:

Vej : f (fj )< k • min(f (e- )), k > 1 or Vej : f (ey )> k • max(f (ey)), k < 1 (2)

where j = 1,...,m - is the number of conditions of gene expression determining, f (ej) - is the thresholding

coefficient, what determines empirically in each case. The choice of thresholding coefficient value was carried out by analysis of changes of mean-square value of distance from objects to mass center of homogenous group of objects (clusters) during the removing of studied database columns:

1 n1

D = - Z ¿2 (Xi, Cs ) (3)

n1 i=1

and average of intercluster distance which is determined for two clusters as mean-square value of distance from objects of cluster S to mass center of cluster P and inversely:

DOUt 2

^ 1 n 1 n2 /

— Z d 2 (xi, Cs ) + — Z d 2 (xi, CP ) n1 i=1 n2 i=1

(4)

where n1 and n2 - are the numbers of objects in clusters S and P respectively, c - is the mass center of corresponding cluster:

1 n m

C =-ZZ Xij (5)

n • mi=1 j=1

MATEMATHHHE MOflEHWBAHHX OBHHHHXI TEXHQJQriVHHX nPQUECm I TEXHIVHHX CHCTEM

n - is the number of objects, m - is the number of features, that characterize objects. In the case of several clusters average intracluster distance was determined as average for intracluster distances of all clusters:

1 q

Din = -1 Ds , (6)

qs=1

and average intercluster distance was determined as average for intercluster distances of all pairs of clusters. However, it should be noted that the absolute values of the criteria defined by formulas (3) - (6) have the disadvantage. The average density of the objects within the clusters distribution can be decreased during increasing of the number of removing columns (change of the thresholding coefficient k value), that will cause the increase of the average intracluster distance. Herewith, the increase of average of intercluster distance with more high speed is possible this increase may be indicated by better quality of objects division. In this case the use of complex relative criterion, extremum of which will allow to optimize the reasonable choose of thresholding coefficient, is appropriate:

Dout

(7)

R =

out

Din

As an experimental base for research we used a database of patients with lung cancer E-GEOD-68571 of the database Array Express [7], which includes the gene-expression profiles of 95 patients, ten of which are healthy (Norm). The rest 85 patients were divided by the degree of the disease into three groups: 23 patients are in good state (Well), 41 patients are in moderate state (Moderate-Md), and 21 patients are in poor state (Poor). The processing of the DNA microarray scanned image was carried out in the following way: background correction by rma method, quantil normalization, mass-PM correction and summarization by mass-method. As the result, the matrix in size (96*7129) of gene expression profiles was obtained, where number of rows corresponds to the number of experiments and numbers of columns are the number of conditions of estimation gene expression profiles. There were no missing values at the obtained matrix of data.

Results of the experiment are shown in Fig. 1-3.

a) Euclid distance within clusters

b) Euclid distance between clusters

<2 S

* • •

_______ _____________,..,.. ............................. .. -

5 10

thresholding coefficient

15

5 10

thresholding coefficient

15

c) Relation of the eu did distances

d) Exemple of data with low and high variance

- 0

• " * *

• » ••J * J**?-.....

• • ( * •1 -

5 10

thresholding coefficient

15

0

"2

tu o

Q. x

9

03 C

m

01

'S

<L> >

0)

«u/i * Low variance 4 High variance

A A« S * i* A " <■

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

20

40 6Û

Number of objects

80

Fig. 1. Experiment results of data filtration based on using of the gene expression variance criterion: a) the plot of average intracluster distance against thresholding coefficient; b) the plot of average intercluster distance against thresholding coefficient; c) the plot of relative distance against thresholding coefficient; d) distribution of studied object genes in case of low and high variance

Analysis of plots at Fig. 1a and 1b allows to conclude about inefficiency at this case of absolute values of intercluster and intracluster distances because during thresholding coefficient increase the values of these criterions monotonically increase too. However, Fig. 1c shows that value of relative distance, calculated by the formula (7) is maximum when value of thresholding coefficient is k = 14. 283 columns had been deleted in this case and new matrix of gene expression obtained dimension (96*6846). In Fig. 1d the distribution of gene expression for low and high variance is shown. In the figure it is shown that removing the column with a low gene expression variance is reasonable, because these columns are not informative for studied objects identification.

MATEMATHHHE MOflEHWBAHHX OBHHHHXI TEXHQJQriVHHX nPQUECm I TEXHIVHHX CHCTEM

Analysis of plots in Fig 2 allows to conclude that the increase of thresholding coefficient value to 1,2 has no significant influence to the value of quality criteria of objects grouping. 89 columns of data were removed from matrix when k = 1,2. The further increase of thresholding coefficient value contributed to the steep increase of deleted information quantity. This fact is connected with the risk of feature space informativity decrease. Therefore the matrix of studied data takes the size of (96x6760) at this stage. Example of gene expression distribution for two objects with low and high absolute values of expression is shown in Fig. 2d.

a) Euclid distance within clusters

0.0

• /

/ * ....

*

0.5 1.0

thresholding coefficient

1.5

b) Euclid distance between clusters

S g

r

/

/ *

0.0

0.5 1.0

thresholding coefficient

1.5

¡2 t;

0.0

c) Relation of the euclid distances

/

- —

\/

i

0.5

1.0

1.5

thresholding coefficient

LT

u CO

m

rd

f o

n

X

111

o

r

CI!

Oi

tM

(1)

>

(1) _l o

d) Exemple of data with low and high absolute values

Low absolute values High abso'ij'e val jes

A

. i ;J| . .

20

40

00

80

Number of objects

Fig. 2. Experiment results of data filtration based on using of the gene expression absolute value criterion: a) the plot of average intracluster distance against thresholding coefficient; b) the plot of average intercluster distance against thresholding coefficient; c) the plot of relative distance against thresholding coefficient; d) distribution of studied object genes in case of low and high absolute values

In Fig. 3 the results of the experiment using of entropy criterion are shown. On the basis of the experiment results analysis the value of thresholding coefficient was accepted as 0,999, herewith 101 column was deleted and the matrix of new data takes the size (96x6659). Example of gene expression distribution for two objects with low and high Shannon entropy of expression is shown in Fig. 3d.

Estimation of the proposed method effectiveness was carried out by the objects clustering of initial set of normalized data, principal components of initial data set and filtering data principal components. Simulation was carried out by software KNIME using SOTA algorithm clustering [10]. All meaningful principal components were taken for component analysis and the size of array of studied objects decreased to (96*94). In Table 1 the results of simulation are shown.

Table 1

Clustering error Initial set Denoise set PC initial data PC filtr. data

Well^Md 5 5 5 4

Md^Well 3 3 3 2

Poor^Md 5 5 5 4

Md^Poor 8 7 8 7

Poor^Well 1 1 1 -

Data analysis of table 1 allows to draw a conclusion on effectiveness of the proposed method, because the principal components of filtering data have less clustering error at the same conditions of clustering. The system has divided the studied objects into patient and not-patient in all cases, however the results are different within the patients group. Intersection of clusters with poor and moderate and with well and moderate states is completely logical, as for as moderate state can be both a moderate-poor and a moderate-well, but intersection of principal components of filtering data is less than in all other cases. However, clusters intersection with poor and well states is inadmissible. This intersection has not been of analyses of principal filtering data only.

MATEMATHHHE MOflEHWBAHHX OBHHHHXI TEXHQJQriVHHX nPQUECm I TEXHIVHHX CHCTEM

a) Euclid distance within clusters

b) Euclid distance between clusters

0.8980 0 9985 0 9990 0.9995

thresholding coefficient

1.0000

09930 09985 0 9990 0.9995

thresholding coefficient

1.0000

c) Relation of the euclid distances

d) Exemple of data with low and high entropy

S£ CI)

"D S

v..

~......7 —* —

• *

*

CL

<13

m

00 -

o CO * High entropy

* Low entropy

o _

o OJ Pi . A Ju 1 A

o -

0 9980 0 9985 Û 9990 0.9995

thresholding coefficient

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

1 0000

20

40 60

Number of objects

80

Fig. 3. Experiment results of data filtration based on using of the gene expression entropy Shannon criterion: a) the plot of average intracluster distance against thresholding coefficient; b) the plot of average intercluster distance against thresholding coefficient; c) the plot of relative distance against thresholding coefficient; d) distribution of studied object genes in case of low and high entropy

Conclusion

This article presents the method of step by step DNA nucleotide filtration, obtained by DNA microarray experiments. Variance of gene expression, absolute value of expression and Shannon entropy were used as criteria to estimate the informativity of the studied objects features vectors. As an experimental base for research we used a database of patients with lung cancer E-GEOD-68571 database Array Express, which includes the gene-expression profiles of 95 patients, ten of which are healthy (Norm), and 85 patients are divided by the degree of the disease into three groups: 23 patients are with good state (Well), 41 patients are with moderate state (Moderate-Md), and 21 patients are with poor state (Poor). The evaluation of thresholding coefficient for removing of not-informative columns was carried out on basis of intergroup and intragroup distances calculating, herewith Euclid distance was used as a measure of proximity. 470 columns were removed during data filtration and dimension of initial array of the studied data was changed from (96x7129) to (96x6659). The cluster analysis using SOTA clustering algorithm was carried out for the evaluation of the effectiveness of the method, herewith the principal components were calculated at the preliminary stage for the purpose of feature space dimension reduction, and the number of columns of database was reduced to 94. The results of experiment have shown higher quality of filtering data principal components clustering, because the cluster intersection of objects with poor and well state inside the patients group was not observed only in this case. Moreover, the use of filtering data principal components allowed to get the best separation ability of studied objects clustering.

References

1. Ozsolak F. RNA sequencing: advances, challenges and opportunities / F. Ozsolak, P.M. Milos // Nature Reviews Genetics. - 2011. - Vol.12. - P.87-98.

2. Schena M. Microarray biochip technology / M. Schena, R.W. Davis // Eaton Publishing. - 2000. - P. 1-18.

3. Baldi P. DNA Microarrays and gene expression: From experiments to data analysis modeling / P. Baldi, G.W. Hatfield // Cambridge University Press. - 2011. - 15 p.

4. Berthold M.R. Data Preparation, Guide to Intelligent Data Analysis / M.R. Berthold, C. Borgelt, F. Hoppner, F. Klawonn // Springen-Verlag London Limited. - 2010. - 394 p.

5. Jianan W. A Novel Workflow for Microarray Data Analysis under Expression Level / W. Jianan, Z. Chunguang, L. Zhangxu, X. Xuefei, Z. You, L. Guixia // Information & Computational Science. -2012. - Vol. 16(9). - P. 4745-4754.

6. Shannon C.E. A mathematical theory of communication / C.E. Shannon // Bell System Technical Journal. - 1948. - V. 27. - P. 379-423, 623-656.

7. Beer D.G. Gene-expression profiles predict survival of patients with lung adenocarcinoma / D.G. Beer, S.L. Kardia, C.C. Huang, T.J. Giordano, at al. // Nature Medicine. - 2002. - Vol. 8(8). - P. 816-824.

i Надоели баннеры? Вы всегда можете отключить рекламу.