Научная статья на тему 'Application of group contribution methods for prediction of heats of formation'

Application of group contribution methods for prediction of heats of formation Текст научной статьи по специальности «Медицинские технологии»

CC BY
153
18
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
GROUP CONTRIBUTION METHOD / ADDITIVE SCHEME / HEAT OF FORMATION / QSPR / SOFTWARE MODELLING

Аннотация научной статьи по медицинским технологиям, автор научной работы — Paskaleva Vesselina, Pukalov Ognyan, Kochev Nikolay

We present a set of group contribution models for predicting heat of formation of organiccompounds. A dataset containing 1004 molecular structures from DIPPR database was split into a learning set and a test set further used for model training and validation. The model building workflow was performed with an in-house developed software Ambit-GCM. The fully customizable and flexible features of the software tool allow execution of exhaustive data exploratory analyses in the context of QSPR/QSAR modelling. A set of preliminary models were build according to various fragmentation schemes, with and without use of correction factors and external descriptors. Different orders of additive schemes were studied. Every model in the set was validated using leave-one-out procedure and Y-scrambling technique as well as model performances were tested using the external dataset. Best models were chosen as final model set and corresponding statistical characteristics are presented and discussed. The best model is available as a JSON at https://doi.org/10.5281/zenodo.1494281 file and can be used for theoretical prediction of heat of formation.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Application of group contribution methods for prediction of heats of formation»

Научни трудове на Съюза на учените в България - Пловдив. Серия В. Техника и технологии. Том XVII, ISSN 1311 -9419 (Print); ISSN 2534-9384 (Online), 2019. Scientific Works of the Union of Scientists in Bulgaria - Plovdiv. Series C. Technics and Technologies. Vol. XVII., ISSN 1311 -9419 (Print); ISSN 2534-9384 (Online), 2019

APPLICATION OF GROUP CONTRIBUTION METHODS FOR PREDICTION OF HEATS OF FORMATION Vesselina Paskalevaa, Ognyan Pukalova, Nikolay Kocheva "Faculty of Ch emistry, University of Plovidv "P. Hilendarski", 24 Tsar Assen str., Plovdiv 4000, Bulgaria

Abstract

We present a set of group contribution models for predicting heat of formation of organic compounds. A dataset containing 1004 molecular structures from DIPPR database was split into a learning set and a test set further used for model training and validation. The model building workflow was performed with an in-house developed software Ambit-GCM. The fully customizable and flexible features of the software tool allow execution of exhaustive data exploratory analyses in the context of QSPR/QSAR modelling. A set of preliminary models were build according to various fragmentation schemes, with and without use of correction factors and external descriptors. Different orders of additive schemes were studied. Every model in the set was validated using leave-one-out procedure and Y-scrambling technique as well as model performances were tested using the external dataset. Best models were chosen as final model set and corresponding statistical characteristics are presented and discussed. The best model is available as a JSON at https://doi.org/10.5281/zenodo.1494281 file and can be used for theoretical prediction of heat of formation.

Key words: group contribution method, additive scheme, heat of formation, QSPR, software modelling

Introduction

Heat of formation (Hf) is an important fundamental physicochemical property of the molecules used actively in the process of discovery and development of new energetic compounds (Vatani, Mehrpooya and Gharagheizi, 2007) (Elioff, Hoy and Bumpus, 2016). Hf value is needed when thermal efficiency of the equipment for producing power or heat is considered (Gharagheizi, Mirkhani and Tofangchi Mahyari, 2011). Group contribution methods are widely used for theoretical calculation of heat of formation property (Vatani, Mehrpooya and Gharagheizi, 2007). These methods are one of the main techniques for QSAR/QSPR modelling and are also known as additive schemes or substructure-based methods. Property of interest for these methods is modelled by summing the contributions of structural fragments with different sizes. Structural fragments can be described using various local attributes (e.g. atom type, valence, number of hydrogen atoms, pi-systems, etc.).

Used software

Models for heat of formation prediction were built using open-source tool Ambit-GCM v.1.1 (Kochev et al., 2018), an in-house developed software that provides an environment for creating models based on zero, first and second order additive schemes as well as customized higher order schemes. Ambit-GCM (https://zenodo.org/record/1470792) is a Java-written software

based on chemoinformatics library CDK (Willighagen et al., 2017) and is part of the chemoinformatics platform Ambit (Jeliazkova and Jeliazkov, 2011). Molecular properties are calculated by an equation of the type:

P = nFrag(i)-1 Frag(i) + ^ nCF(J)-1 CF(J) + ^ dk-1 d(k) i j k where IFrag($ is the increment (contribution) of fragment Frag(i) and nFrag(^ is the number of occurrences of Frag(i) within particular compound; accordingly nCF(j) and ICF(j) are the occurrences and increment values for correction factor j; I d(k) is the increment for external descriptor dk. The group contributions are obtained using multivariable linear regression analysis. The software supports dynamic configuration for atomic descriptions, user defined correction factors and an option to use externally calculated descriptors (typically used as additional correction factors or global descriptors for the model). Ambit-GCM tool supports methods for model validation, e.g. leave-one-out cross-validation procedure and Y-scrambling techniques as well as testing with external datasets. Different statistical parameters are calculated during model building and validation.

Datasets description

The dataset used for this study contains 1004 molecular structures and their experimentally derived standard state enthalpy of formation measured at 298.15K and 1bar from the DIPPR database accessed from (Vatani, Mehrpooya and Gharagheizi, 2007). The dataset was split randomly in ratio 80:20, generating a learning set with 803 molecular structures and a test set containing 201 molecular structures. Distributions of the target property (Hf) for the learning and test sets are given in Figure 1.

3 u 3

V -D

E

3

■ learn set □ test set

(N m

ID r-v

LO

m ID

■ o

m m

|.

o r-v

o (N (N UO

O o

* ! ' O

(N m

(N m

I

LO

oo

LO

n Un ■■

o rn oo

oo uo cm en id m O eri io oo

r-v ID LO

(N (N ^ m

r-v (N (N

(N ^ (N (N m ID ID

Hf, kJ/mol

Figure 1 Distribution of the target property (Hf, kJ/mol) values for the learning and test sets

The molecular structures from learning and test sets are described according to their molecular weight (MW) (see Figure 2) and the number of heavy (non-H) atoms (Dragon descriptor nSK) in Figure 3. Most of the structures from both sets have MW in the range [60; 230]. Approximately 1% of the structures exhibit MW less than 60. Molecular weight over 400 is observed in 3% from the learning set structures and 1% from the test structures.

o

0:2

es

VD VO

(N

m m

■ learn set □ test set

m es o

(N m m

Kri c r

,J5 ^ <2

oc^oot^vouo^me^

^ m^vooooe^^vooooe^mmt^c^ <N (NtNtNtNcnmcnmcn^^^^^^

A

MW

Figure 2 Distribution of the dataset structures from the learning and test sets according to their molecular

weight

The learning set includes structures containing from 2 to 50 heavy atoms and respectively the test set structures span from 3 to 32 heavy atoms. Molecular structures containing more than 25 heavy atoms are 3.4% of the learning set and 1.5% of the test set.

7 (Nj (NtNtNtNcnmcnm

Figure 3 Distribution of the dataset structures from the learning and test sets according the number of

heavy atoms (nSK)

The statistics from figures 1, 2 and 3 show that the used datasets are well balanced in respect to the Hf values and structural diversity.

Descriptor calculation and selection

Molecular descriptors for all compounds in the dataset were calculated using Dragon7 v.7.0.8 software ('Kode srl, Dragon (software for molecular descriptor calculation)'). An initial set of 5270 descriptors was generated, which subsequently was reduced by excluding descriptors with constant values, near-constant values and the descriptors with at least one missing value. Thus, the number of descriptors left after filtration and clean-up procedures was reduced to 2016.

Afterwards, two methods for feature selection implemented in the Weka v.3.8.2 software (Witten et al., 2016) were used. The first method is the default Weka method based on evaluator CfsSubsetEval and search method BestFirst. Using this approach, 19 descriptors out of 2016 were

selected. The second method uses evaluator CorrelationAttributeEval and search based on Ranker, which calculated ranks for all of the descriptors.

During model building, different subsets from the descriptors were used as correction factors.

Exploratory analysis and model building results

Group contribution methods based on atomic and bond schemes were used during the exploratory data analysis stage. All possible variations for atom coding were tested for both types of additive schemes. The models were validated using leave-one-out cross validation and Y-Scrambling procedure. Model predictive ability was examined using external dataset test.

Initially the models were built without correction factors and external descriptors. As a result, a set of models were collected and compared according to their statistical results. The best atomic (model 1) and best bond model (model 4) statistical parameters are given in Table 1

Table 1 Statistical parameters for the best Heat of formation models

N LD coding FN ED CorrF Order Model Training

r2 R2 RMSE MAE

1 A<H,HeN> 29 - No Atomic 0.9943 0.9943 36.6 27.2

2 - Yes 0.9947 0.9947 35.1 25.99

3 1 Yes 0.8943 0.8880 161.9 84.4

4 A<H,Val> 74 - No Bond based 0.9732 0.9726 80.0 36.9

5 - Yes 0.9735 0.9729 79.6 36.1

YS- LOO validation

1000 R2 RC2 Q2 RMSE MAE

1 -0.05 0.9938 0.9937 0.9938 38.3 28.4

2 0.03 0.9942 0.9942 0.9942 37.0 27.3

3 -0.19 0.8757 0.8707 0.8584 181.99 87.8

4 -0.05 0.9687 0.9684 0.9681 86.4 39.2

5 -0.03 0.9686 0.9683 0.9679 86.6 38.6

N External model test

R2 Rc2 Q2 RMSE MAE

1 0.9868 0.9864 0.9866 35.3 26.3

2 0.9895 0.9893 0.9894 31.5 23.7

3 0.6732 0.6544 0.6559 178.9 86.8

4 0.9560 0.9555 0.9558 64.1 31.1

5 0.9583 0.9579 0.9580 62.5 29.3

Legend: N - model number LD - local descriptor FN - number of fragments ED - external descriptors CorrF - correction factor/s r2 - Pearson correlation coefficient R2 - Coefficient of determination Q - Prediction coefficient Rc2 - Concordance correlation coefficient RMSE - Root mean square error MAE - Mean absolute error

In the second stage, models 1 and 4 were rebuilt by adding correction factors defined by means of appropriate SMARTS (SMARTS) fragments as follows: model 2: [c;r], [#6,#7;r]; model 3: [#6;r], [#7;r]; model 5: [c;r6], [#7;r5], [#7;r6], [#8;r5,r6].

Correction factors definitions were made after expert analysis of the molecular structures for the compounds exhibiting largest differences between the predicted and experimental Hf values. As it can be seen, defined SMARTS correction factors are assigned to cyclic systems with different sizes. Statistical parameters for rebuilt models are also given in Table 1 (see models 2 and 5). Finally, we studied the influence of external descriptors.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Statistical results show that atom based model gives better statistical results in comparison with bond based model. It uses less number of fragments - 29 atomic types compared to 74 bond types. The full list of generated fragments (defined atomic and bond types) for Hf models is given in Table 2. The only external descriptor used in model 3 is SssssC (atom type (>C<) E-State sum non-hydrogen indices) calculated by Dragon v7.0 software. The addition of correction factors reduces slightly the RMSE and MAE of the models. However, the inclusion of external descriptors to the models leads to worse statistical results. Variety of combinations of external descriptors set were tested but none improved the model.

Table 2. List of local descriptors used in atomic and bond based models

LD A<H,HeN> Atomic (models 1, 2 and 3) LD A<H,Val> Bond based (models 4 and 5)

C22 S04 C24-N23 002=S04 S02-S02 C14=C24 Br01-C04

C21 S03 C14=002 C14-Si04 C24-C24 C04-001 C14-F01

C02 Si04 C04-C34 C04-F01 C14-002 C04-002 C04-S12

C04 Si03 C24-I01 002=S06 C34-N13 C14-012 C04=C14

F01 C31 C24-S02 C04-Cl01 C14-S02 C04=N03 C04-N34

C03 C13 Br01-C14 C24-S12 C24-Si04 C04-I01 C04-012

I01 C12 C24-002 N03-P05 C04#N03 C24-C34 C14-N03

N21 Cl01 002-S04 Br01-C24 C14=N03 002=P05 C04-C14

N01 N31 C04=C04 002-Si04 C14-Cl01 C14-S12 C34-012

N22 011 C34-002 C14-C14 C04=C24 C04-C04 012-S06

002 Br01 002-S06 C24-012 Cl01-Si04 C04-C24

N03 N12 C24-S06 C04-N13 C14-C24 C34-F01

001 S11 C04-N03 C14=C14 C24-N03 C34-Si04

N02 C34-S02 C14-N24 C34-N03 C14-C34

S02 C04=002 C24-F01 C04-S02 C24-N13

P04 C14-N13 C14-N23 C04-N23 C24-Cl01

The predicted results for the chosen models were compared according their relative error calculated by the formula:

Predicted — Experimental |Experimental|

Comparison of the results obtained for the external dataset according their relative errors for models 1 and 4 (without correction factors) is given in Figure 4 and for models 2 and 5 (with correction factors) in Figure 5. Negative bins in the histogram shows RE% where the predicted values are smaller than the experimental values and the positive bins are for RE% where the predicted values are larger than the experimental values. For example, the result for bin 6 shows that 39 structures have RE% in the range (2;6] obtained from model 1 and the predicted values given by the model are larger than the experimental values for the structures.

o

LO

-30 -26 -22 -18 -14 -10 -6 -2 2 6 10 14 18 22 26 >30

Relative % Error

Figure 4. Distribution of the relative error (%) for models 1 and 4

According to the RE% obtained from application of model 1 to the external dataset, 76% of the structures have relative error in the range (-10;10], respectively when applying of model 4, 73% of the structures have relative error |RE%| < 10%. The inclusion of correction factors in models 2 and 5 increases the structures with |RE%| < 10% to be respectively 80% and 76% i.e. slight improvement of the models is observed.

0

LO

-30 -26 -22 -18 -14 -10 -6 -2 2 6 10 14 18 22 26 >30

Relative % Error

Figure 5 Distribution of the relative error (%) for models 2 and 5

Comparison of the predicted and experimental result for the atomic (model 2) and bond based (model 5) models with correction factors is given in Figure 6. The atomic model (model 2) has better performance than created bond based model.

Pred,

-2500

▲ model 2

500

Pred.

-2500 -1500

500 A

A model 5

-2500

Exp.

-2500

500

Exp.

Figure 6. Comparison of the predicted and experimental results for models 2 and 5

The results obtained using model 2 are comparable with the results described in (Vatani, Mehrpooya and Gharagheizi, 2007).

Example fragmentation for the molecule of butane-1,2-diol is given in Figure 7. The fragmentation is based on atom coding used in model 2 A<H,HeN>. Four fragments are found and their occurrences and contributions are used for the model calculation as follows:

Hf,(PredjCted) = Ii*N(C<13>) + I2*N(C<22>) + I3*N(C<31>) + I,*N(O<11>) = = 1* (-22.522) + 2* (-28.347) + 1* (-34.768) + 2*(-205.096) = -524.176 kJ/mol

The predicted heat of formation, -524.2 kJ/mol, is very close to the experimental value: -523.6 kJ/mol.

C<31> Fragment Occurance Contribution

H3C-C 22 C<13> 1 -22.522

C<13> C<22> 2 -28.347

OH C<31> 1 -34.768

O<11> O<11> 2 -205.096

C<22>

Figure 7.Fragmentation for butane-1,2-diol using atom coding A<H,HeN> and fragments contribution

Example application of gcm-predict (Ambit-GCM) module for a single structure is given

below:

java -jar gcm-predict-v1.1.jar -s CC(C)OCC(C)O -c model_2.json

GCM value (Hf) for CC(C)OCC(C)O is -528.7163407123614

Heat of formation for the molecule of 1 -isopropoxy-2-propanol (inputted as a SMILES) is predicted by model 2 specified as an input json file (previously created with Ambit-GCM). Json file contains the full model information: target property, model type, local descriptors coding, correction factors, filtration threshold, fragments and their increments. The predicted heat of formation value is -528.7 kJ/mol. Experimental value for this compound is -529.6 kJ/mol.

The gcm-predict (Ambit-GCM) module can also be applied for a set of structures. An example follows with 5 molecules inputted as a *.csv file:

java -jar gcm-predict-v1.1.jar -i Prediction_Examples.csv -c model_2.j son

GCM calculateting property Hf for 5 molecules ...

Mol#,ModelValue(Hf),SMILES,CalcStatus

1,-126.23055043388635,CCCC, OK

2,-353.4883670381994,c1ccc(c(c1)O)O,OK

3,-524.1758445616103,CCC(CO)O, OK

4,-220.91676720730212,CC(Cc1ccccc1)O,OK

5,-728.1309488149211,C(C(Cl)(Cl)F)(F)F,OK

The output lines contain: molecule number, predicted Hf, SMILES, and calculation status

Conclusions

Several QSPR models were built using Ambit-GCM software tool. The best group contribution model for prediction of heat of formation of chemical compounds was chosen and validated. The obtained statistical results showed that the model has a good predictive ability and can be effectively applied for theoretical calculation of heats of formation of organic compounds. The model JSON file, full learning and testing raw data and model characteristics can be downloaded from following address https://doi.org/10.5281/zenodo. 1494281. Software tool gcm-predict (Ambit-GCM) is available at https://zenodo. org/record/1470792.

Acknowledgements

We would like to thank Plovdiv University Scientific Fund (project Myi7-X®-027) for supporting this scientific work.

References:

Elioff, M. S., Hoy, J. and Bumpus, J. A. (2016) 'Calculating Heat of Formation Values of Energetic Compounds: A Comparative Study', Advances in Physical Chemistry. Hindawi, 2016, pp. 1-11. doi: 10.1155/2016/5082084.

Gharagheizi, F., Mirkhani, S. A. and Tofangchi Mahyari, A.-R. (2011) 'Prediction of Standard Enthalpy of Combustion of Pure Compounds Using a Very Accurate Group-Contribution-Based Method', Energy & Fuels, 25(6), pp. 2651-2654. doi: 10.1021/ef200081a.

Jeliazkova, N. and Jeliazkov, V. (2011) 'AMBIT RESTful web services: an implementation of the OpenTox application programming interface.', Journal of cheminformatics, 3(1), p. 18. doi: 10.1186/1758-2946-3-18.

Kochev, N. et al. (2018) 'Ambit-GCM: a Software Tool for Group Contribution Modelling'. doi: 10.5281/ZEN0D0.1470793.

'Kode srl, Dragon (software for molecular descriptor calculation)' (no date).

SMARTS, Daylight tutorial (no date). Available at: http://www.daylight.com/dayhtml_tutorials/languages/smarts/index.html (Accessed: 22 November 2018).

Vatani, A., Mehrpooya, M. and Gharagheizi, F. (2007) 'Prediction of Standard Enthalpy of Formation by a QSPR Model', pp. 407-432. doi: 10.3390/i8050407.

Willighagen, E. L. et al. (2017) 'The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching', Journal of Cheminformatics. Nature Publishing Group, 9(1), p. 33. doi: 10.1186/s 13321-017-0220-4.

Witten, I. H. et al. (2016) Data mining: practical machine learning tools and techniques. Fourth Edi. Morgan Kaufmann.

i Надоели баннеры? Вы всегда можете отключить рекламу.