Информационно-вычислительная система для анализа генетических данных на основе перестановочного теста

Ковалевский Валерий Викторович; Хайретдинов Марат Саматович; Якименко Александр Александрович; Грищенко Михаил Владимирович

УДК 004.942

THE INFORMATION COMPUTATION SYSTEM FOR GENETIC DATA ANALYSIS ON THE BASIS OF A PERMUTATION TEST

Valeriy Viktorovich KOVALEVSKY1, Marat Samatovich KHAIRETDINOV12, Aleksandr Aleksandrovich YAKIMENKO12, Mikhail Vladimirovich GRISHCHENKO2

1 Institute Computational Mathematics and Mathematical Geophysics of SB RAS 630090, Novosibirsk, Lavrent'ev's ave., 6

2 Novosibirsk State Technical University 630073, Novosibirsk, Karl Marks ave., 20

The genetic data analysis using the hybrid supercomputer has been performed. The architecture of the information computation system of search for traits of genes statistically significant overrepresented characteristics from the given set has been presented. The capabilities of one-level paralleling are demonstrated, and methods of modernization of the developed software are proposed. Modern technologies of paralleling are used.

Keywords: genetic hypotheses, multiple testing, permutation test, GPU, computer technology, information system.

Genetic analysis relates to the branch of genetics dealing with mechanisms of genetic determination of traits. Here problems of formal genetics associated with formalization of inheritance models and testing of genetic hypotheses on specific empirical material occupy an important place. Resampling methods are widely used to solve such problems [1, 3]. These methods were chosen, since they do not need any information about the data distribution law in the general population, but investigate sample data in various combinations, as if considering them from different angles. When using resampling methods there is no need to correct the statistical significance level for simultaneous testing of many statistical hypotheses reflecting, in the case of biological data analysis, the contribution of many factors to the formation of one hereditary trait. In most biological investigations, resampling methods are more correct than analytical methods, but require much computer power.

To apply the above-mentioned methods, the authors developed a parallel computer information technology of search for statistically overrepre-sented traits of genes under various external or in-

ternal conditions implementing the permutation test algorithm [6] using graphic processors. However, the number of requirements to the earlier created technology increases: First of all, the set of genetic hypotheses tested simultaneously, as well as the calculation efficiency, increase. There also emerged new requirements to the organization of remote access to the permutation test program and delivery of results to the user, to the creation of a medium for exchange of experience, and bibliographical data on this subject. This stimulates further development of this technology on a hybrid supercomputer, which is the goal of this paper.

ARCHITECTURE OF THE COMPUTER

INFORMATION SYSTEM

The general structure of the program is presented in Fig. 1. The program is implemented at the following three major stages: 1) reading of the input file and formation of data arrays in a form convenient for further calculations; 2) calculation of the sums (or any other quantities, for instance, means, variances, etc.) of measured values of a gene for its

Kovalevsky V.V. - doctor of technical sciences, deputy director for science, head of laboratory for geophysical informatics, e-mail: kovalevsky@sscc.ru

Khairetdinov M.S. - doctor technical sciences, professor, head of the chair for network information technologies, e-mail: marat@opg.sscc.ru

Yakimenko A. A. - candidate of technical sciences, associate professor, e-mail: yakimenko@corp.nstu.ru Grishchenko M.V. - student, e-mail: mikhail.grishch@gmail.com

Fig. 1. Program execution structure

various properties (for instance, functional annotations, FAs), the cycle with mixing of array elements and obtaining of quantities necessary for statistics; 3) calculation of p-values and formation of a file with the results.

At the first stage, there are two text files at the input: one file with the parameters of a permutation test (the number of iterations and the number of permutations in an iteration), and the other with the input data. The input data file must be read into the main memory and represented as data structures convenient for work. The file format is shown in Fig. 2. Here informative data are the identifiers of functional annotations and characteristics of a gene. Their representation plays an important role in increasing the program speed. For this it is reasonable to present the data in the form of two arrays (instead of the structure std::map) to store information about genes: a two-dimensional array reflecting FA inputs into the gene, and a one-dimensional one containing measured characteristics of the gene. This approach makes it possible to decrease the number of data transmitted to the graphic processor memory and use the standard library CuBLAS for matrix-vector multiplication.

Due to some peculiarities of calculations on the GPU [2, 5], paralleling of a large part of the code is not possible. Therefore, only the algorithm for cal-

culating the sums of values of measured characteristics of genes for each of the available FA was implemented on the GPU. This algorithm is in the following: a set of functional annotations is represented in the form of a two-dimensional array with zeroes and unities. The available genes are in the rows, and the FAs constituting them, in the columns. Thus, unity will be at the corresponding place of a FA included in a gene, and zero otherwise. Then matrix-vector multiplication of this transposed MxN array (where M is the number of rows, and N is the number of columns) by a one-dimensional N array containing the values of gene characteristics will result in an array with the sums of FAs of all genes.

FURTHER DEVELOPMENT OF THE SYSTEM

Several hypotheses with the same set of functional annotations should be verified. In this case, the values of gene characteristics are represented not by a vector, but by an MxK matrix (where M is the number of genes, and K is the number of hypotheses) in which every column is a hypothesis. As a result, we have a two-dimensional NxK array, in which N is the number of FAs [7]. The following changes were made in the existing algorithm to simultaneously verify several hypotheses:

Functional annotation identifier

1010 1011 1012 1013

1015

1016

1019

1020 1021 1022

1023

1024

1025

1027

1028 1029 102

1031

1032

1034

1035

1036

1037

1038

1039 103

1041

1042

1043

1044

JU : иииичио;

GO : 0007067; GO : 0008270 GO : 0005515 GO : 0005634; GO : 0006412 GO : 0005975 GO : 0002115 GO : 0003676; GO : 0042254 GO : 0016051 GO : 0030528; GO : 0016246; GO : 0043461 GO : 0000003 GO : 0005515; GO : 0009607 GO : 0016757; GO : 0006508; GO : 0006412 GO : 0016740; GO : 0005515; GO : 0009792; GO : 0000003; GO : 0031072; GO : 0055114; GO : 0006898; GO : 0055085; GO : 0008415;

Gene identifier

GO : 0006754 0 GO : 0004674 1 GO ¡0006950; GO : 0007094 0 0

GO : 0043248 0 0 1 0

GO

GO ¡0032940; GO ; 0006350 GO ¡0003723; GO ¡ 0007283 0 1

GO ¡0006641; GO ¡00 1

GO ¡ 0007379 1 0

: 0006898 ¡ 0040010 ¡ 0005515 GO ¡ 0009303 GO ¡ 0006950 GO ¡ 0006127 GO ¡ 0005488; GO ; 001523Г GO ¡ 00082!

GO ¡ GO ¡ GO ¡

0 0 0 0 1 0

GO ¡ 0005515

Gene characteristic (expression level, rate of evolution, etc.)

!3"; GO : 0006836 191; GO ¡000410: ;

Functional annotations separator inside the gene

0 0

GO ; 0007274 0

Fig. 2. Representation of a gene in the input file

• Multiple testing of hypotheses by replacing matrix-vector multiplication with matrix-matrix multiplication (Fig. 3) was added. The resulting matrix contains a set of arrays with the sums of FAs of all genes for the hypotheses in question.

• Element-by-element permutations of gene characteristics were replaced by permutations of arrays of gene characteristics. This is admissible, because in the general case all hypotheses are independent and independent permutations of the values of gene characteristics are not needed.

Fig. 4 shows how introducing multiple testing of hypotheses affects the total time of the program execution. The program execution time without multiple testing of hypotheses using matrix-vector multiplication is shown by a thin line. It can be inter-

aooaqjaom

^O Яш änO änj ашп

boo boj b0k b¡0 by bfc bmO bmj bmi

coo=aoo*boo+aoj*boj+aom*bmo Thread 1

cij=aiO*boj+aij*bij+ajm*bmj jijead ;*;

cnk=anO*bok+anj *bik+amn*bmk Thread m*k

preted as successive program start a certain number of times. The time spent for simultaneous testing of a certain number of hypotheses using matrix-matrix multiplication is denoted by a heavy line. Thus, for one experiment an advantage of matrix-vector multiplication is evident. This is due to the fact that the function for matrix-vector multiplication is somewhat simpler than that for matrix-matrix multiplication. The larger is the number of hypotheses being tested, the greater is the advantage (as can be expected).

The next step in the program development was to develop architecture of the computer information system (CIS) for organization of remote access to the permutation test program. The CIS includes a

300-

« 200-

<u

§ -100-

0- 1 6 11 21

—о—М*М 9,454 10,277 21,488 15,897

—•—M*V 12,106 79,170 142,068 267,492

Fig. 3. Representation of matrix-matrix multiplication

Hypotheses

Fig. 4. Comparison of the execution times of single and multiple testing of hypotheses (M*M - matrixmatrix multiplication, M*V- matrix-vector multiplication)

medium for exchange of experience and bibliographical data on the subject, organization of efficient remote calculations, and delivery of the results to the user. The system must be well protected against unauthorized access and based on modern technologies. It can be represented as a set of the following components:

• Web server;

• Internet portal;

• Database.

The web server is responsible for the CIS operation logic, that is, for acquisition and processing of users' personal data and compilation of a configuration file to start the permutation test program on the GPU.

The Internet portal allows the users to work with the permutation test in the self-service mode. For this, the site has a system of authorization and registration of users.

The framework Django for the programming language Python was taken for the web server. It was chosen because of its modular character, cross-platform type, and free distribution. Django uses the object approach to work with the database, which greatly simplifies the process of data storage and processing.

The use of the Python makes it possible to maximally simplify user data processing, in particular, the configuration file formation to perform the permutation test.

To store data it was decided to use the database SQLite3 built in the Python. Since SQLite is not an individually operating process but only a library, calls of functions (API) are used as the exchange protocol, which decreases the burden and response time, and simplifies the program. SQLite stores the entire database in one file, which simplifies opera-

tion with transactions. During the execution of a transaction the database file is simply blocked. The structure of the developed database is shown in Fig. 5.

At registration of a new user, a name directory to store the configuration and job files is created on the server.

To increase the efficiency of CIS operation, the server has a function of choosing a method for the permutation test execution. If the number of iterations in the configuration file exceeds 100000, the problem must be sent to the hybrid supercomputer; otherwise it is executed by the server itself.

Providing wide access to the computational resources brings up the question of organizing safety. The simplest preventive measure of protection is hiding the reference to the problem execution for unauthorized users.

To protect the internet portal forms, protection against cross site request forgery (CSRF) built in Django is used. For this, a unique hash key is created on the server when a form is initialized, and a new hash key is created when data from the form are sent, which is then sent to the server for verification. If the keys coincide, the form was not changed and the data can be processed; otherwise the server issues an error message.

To increase the efficiency, the program algorithm of the system must be optimized. This can be achieved in different ways.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Primary processing of input data is performed in several stages: 1) reading of the file with initial data; 2) isolation of groups: FA set, gene identifier, gene characteristic; 3) formation of a matrix with FA inputs into the gene and a matrix with measured gene characteristics. Stages 2 and 3 can be paralleled on the CPU. Since genes are independent, the

Fig. 5. Database structure

Fig. 6. Paralleling of permutation cycle iterations

isolation of groups and formation of matrices with FA inputs in the gene and measured characteristics of the gene can be done independently for each gene.

Iterations of the permutation test are also independent of each other. Therefore, they can be performed simultaneously on several GPUs, which will provide a multiple increase in the efficiency of the system being described (Fig. 6).

CONCLUSIONS

Some results of further development of the computer information technology of a permutation test for genetic data analysis on a hybrid supercomputer are presented, including the function of multiple testing of genetic hypotheses using matrix-matrix multiplication on graphic processors and increasing the calculation speed. The methods of analysis and representation of the input data file in the main computer memory have been improved. The requirements to the software for its introduction into the information system are described.

An information system including a webserver forming a basis for the internet portal operation and a database have been developed and created. Within the framework of the internet portal, mechanisms of registration and authorization of the users have been implemented. A forum to communicate within the internet portal and support feedback between the users and developers has been created. Access to the permutation test program by means of the created information system has been provided.

This work was supported by the grant RFBR 16-37-00240mol_a.

REFERENCES

1. Ashburner M., Ball C.A., Blake J.A. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium // Nat Genet. 2000. 25. 25-29.

2. Boreskov A.V., Kharlamov A.A., Markov-kiy N.D. et al. Parallel computations on the GPU. CUDA architecture and software model. Moscow: Publ. House of Moscow Univ., 2012. 336 p.

3. Efron B. Nontraditional methods of statistical analysis. Moscow: Finansy i statistika, 1988. 263 p.

4. Good P. Permutation, parametric and bootstrap tests of hypotheses. New York: Springer Verlag, 2005. 315 p.

5. Van Hemert J.L., Dickerson J.A. Monte Carlo randomization tests for large-scale abundance datasets on the GPU // Comput Methods Programs Biomed. 2011. 101. 80-86.

6. Yakimenko A.A., Gunbin K.V., KhairetdinovM.S. Search for the overrepresented gene characteristics: The experience of implementation of permutation tests using GPU // Optoelectronics, Instrumentation and Data Processing. 2014. 50. (1). 123-129.

7. Yakimenko A.A., Khairetdinov M.S., Avro-rov S.A. Development of a parallel program to perform a permutation test with the use of GPU // Actual problems of electronic instrument engineering APEIE-2014: abstracts of 12th Int. Conf., Novosibirsk, 2-4 Oct. 2014. Novosibirsk: NSTU, 2014. 1. 723-727.

ИНФОРМАЦИОННО-ВЫЧИСЛИТЕЛЬНАЯ СИСТЕМА ДЛЯ АНАЛИЗА ГЕНЕТИЧЕСКИХ ДАННЫХ НА ОСНОВЕ ПЕРЕСТАНОВОЧНОГО ТЕСТА

Валерий Викторович КОВАЛЕВСКИЙ1, Марат Саматович ХАЙРЕТДИНОВ1,2, Александр Александрович ЯКИМЕНКО1,2, Михаил Владимирович ГРИЩЕНКО2

1 ФГБУНИнститут вычислительной математики и математической геофизики СО РАН 630090, г. Новосибирск, пр. Лаврентьева, 6

2 ФГБОУ ВПО Новосибирский государственный технический университет 630073, г. Новосибирск, пр. Карла Маркса, 20

В статье представлена архитектура информационно -вычислительной системы для поиска признаков статистически значимых перепредставленных характеристик генов из заданного множества. Показаны возможности одноуровнего распараллеливания и предложены пути модернизации разработанного программного обеспечения. Использованы современные технологии распараллеливания.

Ключевые слова: генетические гипотезы, множественное тестирование, перестановочный тест, GPU, вычислительная технология, информационная система.

Ковалевский В.В. - д.т.н., зам. директора по научной работе, зав. лабораторией геофизической информатики, e-mail: kovalevsky@sscc.ru

Хайретдинов М.С. - д.т.н., профессор, зав.кафедрой сетевых информационных технологий

e-mail: marat@opg.sscc.ru

Якименко А.А. - к.т.н., доцент каф. ВТ,

e-mail: yakimenko@corp.nstu.ru

Грищенко М.В. - студент, e-mail: mikhail.grishch@gmail.com

The information computation system for genetic data analysis on the basis of a permutation test

Текст научной работы на тему «Информационно-вычислительная система для анализа генетических данных на основе перестановочного теста»