Machine learning techniques in the analysis of Raman data
O.A. Mayorova1, M.S. Saveleva1, Yu.I. Svenskaya1, D.N. Bratashov2, E.S. Prikhozhdenko2*
1- Science Medical Center, Saratov State University, Saratov, Russia 2- Department of Innovations, Physics Institute, Saratov State University, Saratov, Russia
* prikhozhdenkoes@gmail.com
Raman spectroscopy is a versatile and powerful technique for determining the chemical composition of samples [1]. However, challenges arise when analyzing macromolecules of similar nature (e.g., proteins and fatty acids [2]) or when one component of a mixture is present in low concentrations [3]. Machine learning algorithms can enhance the analysis and provide more accurate results in such cases. Dimensionality reduction techniques, such as principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE), can be used to identify patterns in the data [4] and improve the signal-to-noise ratio without significantly reducing the Raman intensity [3]. Complex tasks may require the training of regression or classification models. Ensemble models, such as those based on gradient boosting, not only provide high accuracy in distinguishing between samples with small differences, but also identify the most significant Raman wavenumbers for the analysis. Thus, Gradient Boosting Classifiers were trained on datasets containing Raman spectra of whey protein isolate (WPI) and WPI with different amounts of hyaluronic acid (HA): 0.1%, 0.25% and 0.5%. (Fig. 1). The accuracy of the models was calculated based on the number of samples in each class. Although the model, which was trained on a dataset of 200 spectra (50 spectra per class), has an accuracy of 0.7 (Fig. 1A), it is still able to differentiate between the spectra of WPI and WPI+0.1% HA with an 83.3% success rate.
Figure 1. Efficiency of classification models based on gradient boosting depending on the number of spectra per class. (A-C) Feature importances calculated for models trained on 50 (A), 100 (B), 150 (C) samples per class. (A-C, insets) Confusion matrices of models calculated on test portion of dataset (20% of all dataset). (D) Average Raman spectra of each class. Light gray vertical lines indicate
wavenumbers with importance > 1%.
Evaluation of model performance based on the number of samples analyzed is of great significance. Although the general assumption is that more data is better, a relatively small number of collected spectra can still yield valuable results.
This research was funded by the Russian Science Foundation, grant number 22-79-10270, (https://rscf.ru/en/project/22-79-10270/).
[1] A. Orlando, F. Franceschini, C. Muscas, S. Pidkova, M. Bartoli, M. Rovere, A. Tagliaferro, A comprehensive review on Raman spectroscopy applications, Chemosensors, 9 (9), 262, (2021).
[2] I.Yu. Yanina, Yu.I. Svenskaya, E.S. Prikhozhdenko, D.N. Bratashov, M.V. Lomova, D.A. Gorin, G.B. Sukhorukov, V.V. Tuchin, Optical monitoring of adipose tissue destruction under encapsulated lipase action, Journal of biophotonics, 11(11), e201800058, (2018).
[3] O.A. Mayorova, M.S. Saveleva, D.N. Bratashov, E.S. Prikhozhdenko, Combination of Machine Learning and Raman Spectroscopy for Determination of the Complex of Whey Protein Isolate with Hyaluronic Acid, Polymers, 16(5), 666, (2024).
[4] Y.J. Liu, M. Kyne, C. Wang, X.Y. Yu, Data mining in Raman imaging in a cellular biological system, Computational and Structural Biotechnology Journal, 18, pp.2920-2930, (2020).