Научни трудове на Съюза на учените в България-Пловдив, серия Б. Естествени и хуманитарни науки, т. XVIII, ISSN 1311-9192 (Print), ISSN 2534-9376 (On-line), 2018. Scientific researches of the Union of Scientists in Bulgaria-Plovdiv, series B. Natural Sciences and the Humanities, Vol. XVIII, ISSN 1311-9192 (Print), ISSN 2534-9376 (On-line), 2018.
МОДЕЛИРАНЕ НА КОЛИЧЕСТВЕНИ ВРЪЗКИ СТРУКТУРА-СВОЙСТВО ЧРЕЗ ОТЧИТАНЕ ПРИНОСИТЕ НА ХИМИЧНИТЕ
ГРУПИ
Огнян Пукалова, Веселина Паскалеваа, Николай Кочева, Нина Желязковаь a Химически факултет към Пловдивски университет „П. Хил ендарске", ул. „Цар Асен" 24,гр. Пловдив4000, b Идеяконсулт ООД, ул. Ангел Кънчев No 4. 1000 София
MODELING OF QUANTITATIVE STRUCTURE-PROPERTY RELATIONSHIPS BY MEANS OF GROUP CONTRIBUTION
METHODS
Ognyan Pukalova, Vesselina Paskalevaa, Nikolay Kocheva, Nina Jeliazkovab a Faculty of Chemistry,University of Plovidv "P. Hilendarski", 24 Tsar Assen str., Plovdiv 4000, Bulgaria b Ideaconsult Ltd, 4 A. Kanchev sti"., Sofia 1000, Bulgaria
Abstract
We present models for theoretical calculation of three physicochemical properties with high importance in drug discovery process. Several group contribution models were developed for prediction of octanol-water partition coefficient (logP), heat of formation (Hf), and molar refractivity (MR). We have used a prototype version of an in-house developed software GCM (Group Contribution Module) which is a part of Ambit software platform.
Each target property was theoretically calculated as a sum of individual increments assigned to specific fragments present in the molecule. Zero-order atomic additive schemes and first order bond-based schemes were studied where the atom class definitions were improved varying local atomic descriptors such as: atom type, H-atoms, hybridization, etc. Additionally some global topological descriptors were used as correction factors. Group contribution values were calculated by means of linear regression analysis applied for the training data sets with experimental values consisting of 13097 (logP), 165 (MR) and 464 (Hf) organic compounds respectively. All structures were topologically represented by SMILES linear notation. Different combinations of the chosen descriptors were studied. The models were tested and statistically validated. Models' test results are presented and discussed.
Key words: group contribution method, additive scheme, atom/bond additive scheme, QSPR, octanol/water partition coefficient, logP, molar refractivity, heat of formation
Introduction
One of the main challenges of chemoinformatics is to create models allowing the prediction of particular chemical properties for a broad and diverse set of chemical compounds. Efficient tools for handling the latter challenge are so called additive modeling methods also known as group contribution methods (Benson, 1969; Kolska, 2012). The target property value is obtained additively by summing the contributions of each fragment, which can be expressed by the following equation:
P=X nFiag(i)IFiag(i) (1)
where iFrag(i) is the increment (contribution) of fragment Frag(i); nFrag(i) is the number of occurrences of Frag(i) within particular compound. Typically group contribution estimation from eq. (1) models efficiently properties that depend on intra-molecular interactions in short distances where the distance is taken into account by the size of fragments {Frag(i)}. This approach could be extended by correction factors taking into account intra-molecular interactions of larger distances by means of specific structural features such as intra-molecular hydrogen bonds, atom pairs, global molecular descriptors etc. In this case eq. (1) must be rewritten as follows:
P_XnFrag(i)IFiag(i) +X nCor(j)ICor(j) (2)
where Ccor(j) is the increment of correction factor of type Cor(j) ; ncorj is the number of occurrences of Cor(j) within particular compound.
On the base of equation (2) we have developed an in-house software system, GCM, for property prediction described in next section as well as we further describe the results from test cases for modeling of three important physicochemical properties.
Model creation and used software
GCM (Group Contribution Module) is developed in our group and is based on CDK library (Steinbeck, 2003). The software module is integrated within open source chemoinformatics platform Ambit (Jeliazkova, 2011). The main functionalities of GCM provide an efficient environment for prediction of molecular properties using additive scheme methods of zero or higher orders. The first prototype of GCM software supports for input several standard molecular formats (SMILES, InChI, MOL files), usage of local and global descriptors, filters for removing linear and co-liner descriptors. The module calculates increment values for the additive scheme by means of linear regression analysis applied for a given training data set. GCM also supports correction factors and external molecular descriptors in order to take into account additional specific interactions in the molecule: hydrogen bonds, long ranged atom-atom interactions etc. The architecture of the GCM module is represented in figure 1.
Figure 1. GCM architecture and chemoinformatics flow chart for group contribution based QSPR.
Use Cases for Group Contribution Modeling
Three important physicochemical properties were considered to evaluate the performance of GCM software: 1) n-octanol/water partition coefficient (logP); 2) Heat of formation (Hf); 3) Molar refractivity (MR). Group contribution models were created for each target property. Both atomic and bond based additive schemes were studied. The atoms were described by different combinations of the following local descriptors: atom type (A), atom hybridization (Hyb), atom valence (Val), number of heavy (non-hydrogen) neighbors (HeN), number of hydrogen atoms (H), formal charge (FC). Example atom groups coding is shown in figure 2. The increments for the fragments of different types were calculated by means of linear regression analysis applied for a given set of compounds. Additionally, set of 1378 global 0D-2D descriptors were tested as correction factors and their influence was evaluated. The external descriptors were calculated using Dragon 7 software (Kode srl, 2017). For the calculated descriptors, variable selection procedure was performed using genetic algorithm based and principal component analyses methods implemented in Weka software (Frank, 2016).
/
0<0,2,1>
OH
0<1,3,1>
Figure 2. Coding of different atomic groups for the structure of furaneol by means of local atomic descriptor configuration: A<H,Hyb,HeN>.
Results and discussion
The obtained statistical results for the six best models are summarized in Table 1. Model validations were performed by means of leave-one-out (LOO) cross validation, Y-scrambling procedure with 1000 iterations and tests with the training data sets.
For building atom based molar refractivity model (MRa) we found 15 different atomic classes (groups) defined by the local descriptor configuration A<H,HeN,Hyb,FC,Val> Atomic types for the bond-based model (MRb) were configured without any information for the atom neighbors. The atom based model for Heat of formation (Hfa) was built using 9 atomic classes described with the local atom properties A<H,Hyb,Val> and 5 additional descriptors nDB, H%, nCsp, nR07, D/Dtr03. The bond-based Hfb model was built with 23 descriptors (atoms within bond groups were described with configuration, A<H>, and chosen external descriptors nDB, nR07, H%, nCsp). The bond based model exhibits slightly better statistical results compared to the atom based model Hfa. The logP model is build with 43 descriptors from which 9 external descriptors and atomic groups defined as A<H> configuration. The obtained results have poor accuracy even for the best atom based model. Applying higher order atom scheme for logP gave much better statistical result. The number of used descriptors almost doubled (94). Atom types were described in the form A<H,FC> and no additional external descriptors were used. Detailed information about the group contributions (increments) for the six models represented in table 1 can be found in Zenodo repository: https://doi.org/10.5281/zenodo.1066277.
Tablel. Statistical characteristics of the created models.
Model Training YS1000 LOO
Nd R2 RMSE MAE R2 Rc2 Q2 RMSE MAE
MRa 15 0.991 0.618 0.235 0.085 0.989 0.989 0.989 0.698 0.262
MRb 7 0.955 1.41 1.03 -0.81 0.973 0.959 0.953 1.451 1.067
Hfa 14 0.931 52.229 28.541 0.026 0.895 0.895 0.893 64.56 32.959
Hfb 23 0.928 53.56 33.163 0.047 0.913 0.912 0.913 58.871 35.598
LogPa 43 0.693 1.081 0.788 0.001 0.691 0.669 0.691 1.086 0.791
LogPb 94 0.847 0.763 0.560 -0.08 0.845 0.842 0.844 0.770 0.565
Visual comparison of the experimental vs. predicted property values for atom (A) and bond-based (B) group contribution models for MR, Hf and logP are given respectively in figures 3, 4 and 5.
Figure 3. Experimental vs. predicted values for MR applying (A) atom group contribution scheme with local atom descriptors A< H,HeN,Hyb,FC,Val> and (B) bond-based group scheme.
Figure 4. Experimental vs. predicted values for Hf applying (A) atom group contribution scheme with local descriptors A<H,Hyb,Val> and (B) bond-based group contribution scheme.
Figure 5. Experimental vs. predicted values for logP applying (A) atom group contribution scheme with local atom descriptors A<H> and (B) bond-base scheme with descriptors A<H,FC>.
Error values distributions for the models (MR, Hf and logP) are shown in figures 6, 7 and 8 respectively. For the MR model, the usage of the bond-based scheme with local atom descriptors not including neighbors' atom information leads to worse predictions (the latter seen in figure 6).
u S o S
o
ta
3 4
A)
7
I
37
24
14
6_l
28
5 ■
ÖÖÖÖÖÖÖ ÖÖÖ lllllll
MR error
28 27
B)
MR error
Figure 6. Distribution of the error values for MR atom group contribution model and bond-based
group contribution model.
u S
e
S
o -
ta
A)
270
118
1 2 15 31 15 7 2 3
H I I ■ H
m VO <N >n <N c^ VO m <u
MD cK CD m Ö o
5 0 1 3 7 2 7 1
1 1 6 1 1 1 1 2 A
u S
e
S
e r
ta
210
3 5 5 4 12
29
51
.il
108
24
4 5 3 1
57525758535853
9
7
4
oT^
<N ^
B) '
0 5
0 8
Figure 7. Distribution of the error values for Hf atom group contribution model and bond-based
group contribution model.
u
e
o S
o
t-
00 oo
vo
OS C^ vo
m
<N
VO
<N
00
cn in
...llllllllll...
00
o t-
m m m (N (N (N ^ ^ ^ ö Ö Ö ööö^^
^ p^ m m ^.o c^ (N <N (N m m m
A)
logP error
u e
e
s ^
e r
t-
<N VO ,—i
<N
<N
<N <N
B)
^t ^ (N 00
o o o o o o logP error
VO r-
00 <N m
<N m VO c^
<N <N <N
Figure 8. Distribution of the error values for logP atom-based group contribution model and bond-
based group contribution method.
The prediction errors for logP atom-based model are in the range -3.9 to +3.9 (except 77 the molecules with even larger errors). Using bond-based group contribution model, the error range shortened to (-2.5, +2.6) where only 37 chemical objects exhibited errors outside this range.
Figure 9 shows application of the atomic group contribution model for predicting of molar refractivity for the molecule of guaiacol where the following equation is applied for the final MR calculation (see guaiacol fragmentation in figure 9):
MRatomic (guaiacol) = Ic<0,3,2 ,0,4>*Nc<0,3,2,0,4> + Ic<1,2,2,0,4>*Nc<1,2,2,0,4> + Ic<3,1,3,0,4>*Nc<3,1,3,0,4> + Io<1,1,3,0,2>*No<i,i,3,o,2> + Io<0,2,3,0,2>*No<o,2,3,0,2> = Ic<0,3,2,0,4>*2 + Ic<1,2,2,0,4>*4 + Ic<3,1,3,0,4>* 1 + Io<1,1,3,0,2>*1 + Io<0,2,3,0,2>* 1= 34.58 sm3/mol
Out of 15 local atom groups, only 5 are found in the molecule of guaiacol (3-Methoxyphenol). The predicted property is very close the value theoretically predicted by ACDLabs software (http://www.chemspider.com/Chemical-Structure.8657.html).
C<1,2,2,0,4>
0<1,1,3,0,2>
OH
C<1,2,2,0,4> C<1,2,2,0,4>
C<1,2,2,0,4>
C<0,3,2,0,4>
C<3,1,3,0,4>
0—CH3
0<0,2,3,0,2>
Group (x)
C<0,3,2,0,4> C<1,2,2,0,4> C<3,1,3,0,4> 0<1,1,3,0,2> 0<0,2,3,0,2>
Increment (Ix)
3.33 4.49 5.73 2.53 1.70
Occurrence (Nx) 2 4 1 1 1
Figure 9. Application of atomic additive scheme for prediction of molar refractivity of guaiacol.
Figure 10 shows bond-based fragmentation for the molecule of guaiacol, which contains 3 out of 7 model groups. Theoretical calculation of molar refractivity this case is as follows:
C<0,3,2,0,4>
MRbond (guaiacol) = Ic-o*Nc-o + Ic=c*Nc=c + Ic-c*Nc-c = Ic-o*3 + Ic=c*3 + Ic-c*3 = 50.37 sm3/mol
OH
Group (x)
Increment (Ix) Occurrence (Nx)
C=C
C-C
C-O
4.46
3
C=C -O0-5CH3
C=C
6.87
3
C-C
5.46
3
Figure 10. Application of bond-based scheme for prediction of molar refractivity of guaiacol. Conclusions
The obtained results show that GCM software module could be successfully used for theoretical prediction of physicochemical properties of organic compounds using group contribution approach. Created models for molar refractivity (MR), heat of formation (Hf) and partition coefficient (logP) contain diverse chemical groups and cover a large part of the chemical space. These models can be used for various chemoinformatics tasks where GCM models can be applied on a wide and diverse range of organic compounds.
Acknowledgements
We would like to thank Plovdiv University Scientific Fund (project Myi7-X®-027) for supporting this scientific work.
References:
Benson S. W., Cruickshank F. R., Golden D. M., Haugen G. R., O'Neal H. E., Rodgers A. S., Shaw R., and Walsh R., Additivity rules for the estimation of thermochemical properties. Chem. Rev., 69(3):279-324, 1969
Kolska Z., Zabransky M., and Randova A., Group Contribution Methods for Estimation of Selected Physico-Chemical Properties of Organic Compounds, Chapter 6 in Thermodynamics -Fundamentals and Its Application in Science, Ricardo Morales-Rodriguez (Editor), InTech, 2012
Steinbeck Ch., Han Y., Kuhn S., Horlacher O., Luttmann E., and Willighagen E., The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics, J. Chem. Inf. Comput. Sci., 43(2):493-500, 2003
Jeliazkova N., Jeliazkov V., AMBIT RESTful web services: an implementation of the OpenTox application programming interface, J. Cheminform., 3:18, 2011
Kode srl, Dragon version 7.0.8 (software for molecular descriptor calculation), 2017, https://chm.kode-solutions.net
Frank E., Hall M. A,, and Witten I. H. (2016). The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition,
2016