Научная статья на тему 'Chemical information processing for QSAR/QSPR modeling'

Chemical information processing for QSAR/QSPR modeling Текст научной статьи по специальности «Химические науки»

CC BY
161
32
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
CHEMOINFORMATICS ALGORITHMS / (QSAR/QSPR) / MATHEMATICAL TOOLS

Аннотация научной статьи по химическим наукам, автор научной работы — Kochev Nikolay, Paskaleva Vesselina, Jeliazkova Nina

We present a combination of chemoinformatics algorithms applied for the efficient processing of structural data used in Quantitative Structure-Activity/Property Relationships (QSAR/QSPR). QSAR/QSPR models use mathematical tools to correlate structural descriptors and biological activities (or other properties). The quality of the obtained molecular descriptors is crucial for the efficiency of the obtained QSAR models where the descriptor values depend on the previously performed procedures for manipulation of the structural information. We applied several chemoinformatics algorithms: substructure searching based on queries described as SMARTS linear notation, structure fingerprint calculation and similarity principle applied on top of the fingerprints, automatic generation of all possible tautomers. In this work we study the influence of the tautomers information on the performance of these algorithms, the resulting molecular descriptors and the final of QSAR/QSPR models.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Chemical information processing for QSAR/QSPR modeling»

Научни трудове на Съюза на учените в България-Пловдив. Серия В. Техника и технологии, естествен ии хуманитарни науки, том XVI., Съюз на учените сесия "Международна конференция на младите учени" 13-15 юни 2013. Scientific research of the Union of Scientists in Bulgaria-Plovdiv, series C. Natural Sciences and Humanities, Vol. XVI, ISSN 1311-9192, Union of Scientists, International Conference of Young Scientists, 13 - 15 June 2013, Plovdiv.

CHEMICAL INFORMATION PROCESSING FOR QSAR/QSPR

MODELING

Nikolay Kochev1, Vesselina Paskaleva1, Nina Jeliazkova2

1 University of Plovdiv, Department of Analytical Chemistry and Computer

Chemistry,

2Ideaconslult Ltd, 4 A. kanchev str., Sofia 1000, Bulgaria, 1e-mail: nick@uni-plovdiv.net, vessy@uni-plovdiv.net, 2e-mail: jeliazkova.nina@gmail.com

Abstract

We present a combination of chemoinformatics algorithms applied for the efficient processing of structural data used in Quantitative Structure-Activity/Property Relationships (QSAR/QSPR). QSAR/QSPR models use mathematical tools to correlate structural descriptors and biological activities (or other properties). The quality of the obtained molecular descriptors is crucial for the efficiency of the obtained QSAR models where the descriptor values depend on the previously performed procedures for manipulation of the structural information. We applied several chemoinformatics algorithms: substructure searching based on queries described as SMARTS linear notation, structure fingerprint calculation and similarity principle applied on top of the fingerprints, automatic generation of all possible tautomers. In this work we study the influence of the tautomers information on the performance of these algorithms, the resulting molecular descriptors and the final of QSAR/QSPR models.

Introduction

QSAR/QSPR modeling is presently widely used in various fields of medicinal chemistry and pharmaceutical industry as well as for a preliminary testing for the chemical substance regulations. Quality QSAR models can be obtained only by the efficient application of the classical and modern chemoinformatics tools [1]. Typically these tools perform the structural information transition: data ^ information ^ knowledge (the obtained models are regarded as formalized knowledge used for solving particular problems in chemistry). In this work we studied the influence of tautomer information on the major stages applied for the realization of mentioned above information transition as well as we tested the influence of generated tautomers on the QSAR modeling of Ames Mutagenicity and XlogP QSPR model for lipophilicity.

Chemoinformatics strategy for QSAR/QSPR modeling

In this work we propose a strategy for QSAR/QSPR modeling (see figure 1) based on a combination of several efficient algorithms for processing of the structural information. All used software components are with an open source, where the most critical tools are developed in our group. The molecule can be entered in the system via standard chemical formats like SMILES [2], InChI [3], MOL/SDF file [4] and CML [5]. The internal structure representation, input, output and

information processing is based on the CDK library [6]. 2D structure diagrams are generated with CDK 2D generator. 3D structure generation is performed with OpenBabel [7] software or other open source solutions. At this stage all tautomeric forms of the target molecule are generated using the incremental algorithm of AMBIT-TAUTOMER [8]. Ambit-Tautomer is part of the open source software package Ambit2 [9, 10] and implements efficient algorithmo for automatic generation of all taueomeric forms of a given compound. The result tautomers aue further used for fingerprint calculation and descriptor calculation as well as for similarity search (figure 1). methimazole

Structure input: C1=CN(C(N1)=S)C /SMILES, InChI, *.mol, CML/

CDK

representation

Connection Table (CDK container)

IJ

■N N1

w

generate tautomers

fingerprints (bit-vectors)

generate

3D <

Calculate 1D, 2D, 3D molecular descriptors

HC

0110001 ... 111011

hashed fingerprint 1100101 . 001010 key-based fingerprint

r

Group counts

additive schemes

NA = 13 Z = 32 NH = 6 W = 40

MW=114.03 ATSc1= 0.14

Chemical DB

Similarity search

QSPR

List of most similar structures

.

Models of

physicochemical

properties:

LogP, BP, MP, MR, .

QSAR

J

Models of biological activities: ADME Toxicity, Mutagenicity, Biodégradation, ...

Figure 1. Flow chart of a chemoiaformatics strategy foR QSAR/QSPR modeling.

AMBIT-SMARTS [11] is an efficient algorithm for substructure searching which is used for the calculation of fingerprints, group counts and some of the molecular descriptors. The molecular descriptors and fingerprint aire calculated by PaDEL-Desriptors v.2.17 [12] which is an open source software based on CDK library and AMBIT-SMARTS package. The final QSARTQSPR models are obtained u^ing the data mmang software Weka v.3.7.9 [133].

Results and Discussion

The similarity search was performed in Ambit DataBase [9,10] (approximately 5.6 million compounds). As target structures we used the three generated tautomeric forms of methimazole. TaMe p clearly shows that the siPlilrrity se arch! results are strongly influenced by tme tautomeric form used as target stwicture.

Table 1. Similarity search results for three tautomers of methimazole

Structure Hits obtained from similarity search in Ambit2 database

(062 060 0 59 0.58

0.47

0.45

0.44

N^N

VJ

i

^ H.C ^

V

0.44 s

N^N

o

NH

We also s1tldiep thee influence of tautomer insoIlPlPtis)n on the poocm of frngerapint and descriptor calculation. T1sree Itloleculea with pr^ctiDal ap]alirati<sats (metMmazole, violuric emid and pemoiine) were cho sen to test two groups of fingerprints: CDK Fingerprinter (1024 bite) and PubChem fingerprinte (8181 bits). Table 2 shows the numbd of fingerprint bits which valuee were altered at least for one of the tautomeric forms of the corresponding test compound. For example the tautomers of pemoline molecule alter 65% of the CDK FP bits (666 out of 1024).

Table 2. The number of fingerprint bits altered by the tautomeric forms

Table 3. The number PaDEL descriptors which have RSD greater than the threshold value

Structure CDK FP (1024) PubChem (881) RSD threshold methimazole violuric acid pemoline

methimazole 145 56 10 % 180 217 239

violuric acid 545 106 30 % 124 151 168

50 % 99 108 138

pemoline 666 132 100 % 71 80 113

Similarly a strong influence of the tautomer information on the molecular descriptor values is observed. We calculated 863 molecular descriptors (1D, 2D and 3D) for all tautomeric forms of the testing compounds. For each descriptor, the relative standard deviation (RSD) due to the tautomerism was determined. Table 3 shows the number of descriptors which exhibited RSD greater than particular threshold (10%, 30%, 50% and 100%). The variances of molecular descriptor values are statistically significant which means that one can expect strong influence of the tautomer information on the final QSAR/QSPR models. That is why we also studied how tautomer information alters the results of two QSAR/QSPR models for the molecule of violuric acid. We applied Ames mutagenicity QSAR model developed by us on the base of the information for 6512 compounds [14]. Violuric acid has 15 tatomers where 2 of the tautomers were classified as non mutagenic and 13 as mutagenic. The values of XLogP [15] varied in the range (-1.26, 1.23) with RSD = 16%. Both models showed that some of the tautomers drastically change their properties (mutagenic/non mutagenic and lipophilic/non lipophilic).

Acknowledgement

This work is supported by the Bulgarian National Fund for Scientific Research NFNI (project IO7/1).

References

[1] J. Gasteiger, in Chemoinformatics (Ed: Th. Engel), Wiley-VCH, Weinheim, 2003, ch. 1, pp. 291-318.

[2] D. Weininger, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , J. Chem. Inf. Comput. Sci., 28(1): 31-36, 1988

[3] http://www.iupac.org/home/publications/e-resources/inchi.html.

[4] A. Dalby, J. Nourse, W. Hounshell, A. Gushurst, D. Grier, B.Leland, J. Laufer, Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited, J. Chem. Inf. Sci., 32(3): 244-255, 1992

[5] P. Rust, H. Rzepa, CML: Evolution and Design, J. Chem. Inf., 3, 44, 2011

[6] C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann, E. Willighagen, The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics, J. Chem. Inf. Comput. Sci., 43: 493-500, 2003

[7] OpenBabel, http://openbabel.org, accessed June 01, 2013

[8] Kochev, N. T., Paskaleva, V. H. and Jeliazkova, N., Ambit-Tautomer: An Open Source Tool for Tautomer Generation. Mol. Inf., 32: 481-504, 2013

[9] N. Jeliazkova, J. Jaworska, A. P. Worth, Open Source Tools for Read-Across and Category Formation, in In Silico Toxicology: Principles and Applications (Issues in Toxicology (Ed: MarcCronin)), Royal Society of Chemistry, London., pp. 408-443, 2010

[10] http://ambit.sourceforge.net/, accessed 01 June 2013

[11] N. Jeliazkova, N. Kochev, AMBIT-SMARTS: Efficient Searching of Chemical Structures

and Fragments, Mol. Inf., 30: 707-720, 2011 233

[12] Yap CW, PaDEL-Descriptor: An open source software to calculate molecular descriptors and fingerprints. Journal of Computational Chemistry. 32 (7): 1466-1474, 2011

[13] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten, The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1, 2009

[14] Katja Hansen,Sebastian Mika,Timon Schroeter,Andreas Sutter,Antonius ter Laak, Thomas Steger-Hartmann, Nikolaus Heinrich and Klaus-Robert Müller, Benchmark Data Set for in Silico Prediction of Ames Mutagenicity, J. Chem. Inf. Model., 49:2077-2081, 2009

[15] Renxiao Wang, Ying Gao and Lunua Lai, Calculating partition coefficient by atomadditive method, Perspectives in Drug Discovery and Design, 19: 47-66, 2000

i Надоели баннеры? Вы всегда можете отключить рекламу.