Научная статья на тему 'DETECTION OF ISOTOPIC PEAK SERIES IN LOW-RESOLUTION MASS SPECTRA USING CLUSTERING ALGORITHM AND CHI-SQUARE TEST'

DETECTION OF ISOTOPIC PEAK SERIES IN LOW-RESOLUTION MASS SPECTRA USING CLUSTERING ALGORITHM AND CHI-SQUARE TEST Текст научной статьи по специальности «Науки о Земле и смежные экологические науки»

CC BY
0
0
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
Mass spectrometry / signal processing / isotopic peak series / DBSCAN clustering algorithm / chi-square test

Аннотация научной статьи по наукам о Земле и смежным экологическим наукам, автор научной работы — Lebedev V.V., Pytskii I.S., Buryak A.K.

This paper presents the algorithm for determining whether a peak detected in mass spectra during signal processing belongs to isotopic peak series. The algorithm’s logic implies preliminary grouping of detected peaks into clusters, checking whether the distribution of peak intensities in each cluster matches the selected pattern, and conducting final grouping which takes the position of peaks along m / z axis into account. The features that enhance the resistance of proposed algorithm to negative phenomena, which can make the detection of isotopic peak series in low-resolution mass spectra by existing methods difficult, are described herein in detail. We present the results of algorithm’s functioning with experimental mass spectra of silver(I) chloride and silver(I) bromide used as input. Tested mass spectra were characterized by various negative phenomena that hinder the detection of isotopic peak series. The proposed algorithm is shown to be capable of grouping peaks with the quality similar to existing linear models while avoiding the usage of empirical rules valid only for certain classes of chemical compounds. Since the algorithm requires selection of pattern to model the distribution of intensities in the possible isotopic peak series, we suggest that practical application of proposed algorithm is viable in cases when multiple similar compounds with known pattern of peak intensity distribution are examined using low-resolution mass spectrometer.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «DETECTION OF ISOTOPIC PEAK SERIES IN LOW-RESOLUTION MASS SPECTRA USING CLUSTERING ALGORITHM AND CHI-SQUARE TEST»

UDC 004.421:543.51

EDN: ODEQRN

DETECTION OF ISOTOPIC PEAK SERIES IN LOW-RESOLUTION MASS SPECTRA USING CLUSTERING ALGORITHM AND CHI-SQUARE TEST

V.V. Lebedev glory.leb@gmail.com

I.S. Pytskii ivanpic4586@gmail.com

A.K. Buryak akburyak@mail.ru

Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, Moscow, Russian Federation

Abstract

This paper presents the algorithm for determining whether a peak detected in mass spectra during signal processing belongs to isotopic peak series. The algorithm's logic implies preliminary grouping of detected peaks into clusters, checking whether the distribution of peak intensities in each cluster matches the selected pattern, and conducting final grouping which takes the position of peaks along m/z axis into account. The features that enhance the resistance of proposed algorithm to negative phenomena, which can make the detection of isotopic peak series in low-resolution mass spectra by existing methods difficult, are described herein in detail. We present the results of algorithm's functioning with experimental mass spectra of silver(I) chloride and silver(I) bromide used as input. Tested mass spectra were characterized by various negative phenomena that hinder the detection of isotopic peak series. The proposed algorithm is shown to be capable of grouping peaks with the quality similar to existing linear models while avoiding the usage of empirical rales valid only for certain classes of chemical compounds. Since the algorithm requires selection of pattern to model the distribution of intensities in the possible isotopic peak series, we suggest that practical application of proposed algorithm is viable in cases when multiple similar compounds with known pattern of peak intensity distribution are examined using low-resolution mass spectrometer

Keywords

Mass spectrometry, signal processing, isotopic peak series, DBSCAN clustering algorithm, chi-square test

Received 07.06.2023 Accepted 26.09.2023 © Author(s), 2024

This work was supported by a grant from the Russian Science Foundation (grant no. 22-13-00266) for the Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences

Introduction. Signal processing is the integral part of working with experimental mass spectrum. Processing of signals comprises several consecutive operations, with peak detection being one of the most crucial among them [1]. The latter operation allows detecting the signals that originate from ions formed during fragmentation of the studied compound, i.e., peaks, in the entire array of registered signals. Numerous algorithms for detection of peaks in mass spectrometric data have been described in literature [2].

By its essence, mass spectrometry allows determining the isotopic composition of studied compound and its fragments. Because of that, majority of ions that are formed during mass spectrometric analysis are registered as a series of several peaks. Those series are known as "isotope clusters" in literature. To avoid confusion with machine learning clustering which is referenced in this research, we use "isotopic peak series" term, or simply "isotopic series", to refer to isotope clusters hereinafter. Peaks that constitute the isotopic peak series correspond to various combinations of isotopes of chemical elements present in the formed ion, and are placed at the distances approximately equal to integer number from each other over m / z axis. Distribution of intensities in isotopic peak series and positions of peaks along ml z axis are two features that provide vitally important information for identification of the formed ion. Respectively, for subsequent analysis, the researcher needs to understand which detected peaks together form isotopic series and originate from the same ions.

Linear models are used in the majority of existing researches devoted to grouping of peaks into isotopic series [3-6]. The decision on merging peaks into single series in such models is primarily based on distance between peaks along mlz axis. Except for paper [3], all other mentioned researches explicitly assume that two peaks can belong to one isotopic series in case the distance between those peaks equals integer number with a margin of error. The maximum tolerable deviation of distance from expected integer number is commonly a user-defined parameter. In papers [5, 6] such deviation of distance was constant over all range of ml z values, meanwhile in research [4] the tolerable deviation increased in a direct proportion to mlz increase. In addition to checking the distance between peaks, all the approaches described in [3-6] implied that the putative isotopic peak series shall comply with special empirical rules. For example, the approach presented in [6] required each succeeding peak in the isotopic series to have lower intensity than the preceding peak. Rules applied in [3, 4] imply usage of coefficients that are calculated using a number of reference mass spectra from various databases. Besides linear models, a contemporary way of grouping peaks into isotopic series is known.

Such a way makes use of machine learning techniques, namely the classifier based on tree ensemble [7]. Several parameters used in [7] when making a decision whether a peak belongs to an isotopic series are also calculated using data from external sources.

It shall be noted that existing approaches to grouping peaks into isotopic series have certain limitations. First of all, majority of such approaches imply usage of high-resolution mass spectra as input data. We failed to find contemporary researches dedicated to grouping peaks from low-resolution mass spectra into isotopic series. The positions of peaks along m/z axis are assumed to be determined very precisely in high resolution mass spectra. This allows significantly (to 104 m / z and further) reducing the area of search for subsequent peak that possibly belongs to isotopic series. Such an assumption, however, does not hold for low-resolution mass spectra. Due to a number of reasons (e.g., equipment calibration errors or detector overload), the observed distance between two peaks belonging to single isotopic series can deviate from the theoretical integer value by tenths of m/z [8] in low-resolution mass spectra. Secondly, the empirical rules applied in known approaches only allow using such approaches effectively when grouping peaks in mass spectra of compounds belonging to specific classes. For example, the rule of monotonic intensity decrease from approach [4] prevents this methodology from being used for grouping peaks that originate from chlorine- or bromine-containing ions. Furthermore, empirical rules often cannot be easily adapted for usage with compound classes different from the one they were formulated for. Thirdly, the similar problem is common for models that use external data, including machine learning-based models. Such models may yield low quality of grouping peaks into isotopic series when used for processing of the mass spectra of compound classes that were not present in the training set during calculation of decision parameters [3, 7]. In addition, required external data may be unavailable for the selected experiment conditions (e.g., ionization type).

The present research was aimed at creating the algorithm for grouping peaks into isotopic series which could be applied for processing low-resolution mass spectra and which would address the limitations of existing approaches. In particular, the requirements for the algorithm comprised insensitivity to moderate errors in determined positions of isotopic peak along m/z axis, usage of more formal approach when making a decision on merging peaks into isotopic series, and independence from external data. The results are presented in subsequent sections of this paper.

Materials and methods of research. Peak grouping algorithm. The proposed algorithm for grouping peaks into isotopic series comprises five steps,

with the first two aimed at obtaining the list of mass spectrum peaks and the remaining three implementing the grouping itself. The full list of steps and their detailed description are presented below:

1) reduction of raw mass spectrum's dimensionality;

2) peak detection;

3) preliminary grouping of peaks into clusters that contain putative isotopic peak series by means of DBSCAN clustering algorithm;

4) application of chi-square test for each cluster containing at least 4 peaks to check whether the distribution of peak intensities in cluster matches the chosen pattern;

5) evaluation of all preliminary clusters that contain less than 4 peaks and all clusters, for which chi-square test was failed, using traditional "naive" method (i.e., by checking the distances between preliminary cluster peaks along m / z axis) and final grouping.

Smoothing and baseline correction were not carried out as part of this research, however, it is assumed that those operations can be performed before the execution of algorithm starts.

The first step, i.e., dimensionality reduction, implies transformation of raw pseudo-continuous mass spectrum into "centroid"-type mass spectrum, where one peak is represented by one "ra / z-intensity" pair of values [9]. In this research, dimensionality reduction was conducted using detectPeaks () method from MALDIquant package for R programming language [10]. This method performs the search of local intensity maxima in a sliding window of fixed width. Only signals whose intensity exceeds the median absolute deviation of intensities of all mass spectrum signals are included into the resulting array. The window width was set to the default value proposed by the developers of MALDIquant package and equaled 20 neighboring data points.

The reduced mass spectrum is used as input for peak detection operation. The proposed approach allows detecting peaks by any desired method that accepts a two-dimensional array of signals as input and produces a two-dimensional array of peaks as output. For testing purposes, peak detection was performed by calculation of signal-to-noise ratio (SNR) over non-overlapping windows. The window size was set to 20 ml z, and local noise level was calculated as the median absolute deviation of signals within the window [11]. The intensity of the signal had to exceed the local noise level at least by 5 times (i.e., SNR = 5) in order for signal to be included into peak list.

Once the peak list is formed, the detection of isotopic peak series itself is carried out. At first the peaks are preliminarily merged into clusters that can possibly contain isotopic peak series. The merger is based on position of peaks

on m/z axis and is conducted by means of DBSCAN clustering algorithm [12]. Named machine learning algorithm splits the objects from input array into non-overlapping clusters (i.e. groups, or classes) based on density of objects' distribution in one-dimensional or multi-dimensional space. This algorithm was previously used to determine which signals constitute a peak on the curve in high-resolution mass spectra of original dimensionality [13]. The classical version of DBSCAN algorithm requires two parameters to run, namely maximum distance between neighboring objects that belong to the same class e and minimum number of neighboring objects to be regarded as one cluster minPts. Within the scope of task under consideration, the preliminary grouping of peaks into clusters is based on a single feature, i.e., their position along m/z axis. The minimum number of peaks in a single isotopic series is 2; respectively, value of minPts also equals 2. The value of e parameter can be set to 2, 3 or 4 depending on which chemical elements are expected to be present in the studied compound (see explanation below).

Subsequently, the distribution of intensities in each preliminary cluster that contains at least 4 peaks is compared with preselected pattern. The comparison is conducted using chi-square test [14]. Intensities of peaks from the tested cluster are used as observed frequencies. The calculation of theoretical probability distribution is conducted by fitting the peak intensities to the chosen pattern and normalizing fitted values. The constraint on minimum number of peaks 4 in the tested cluster is set by built-in fitting tools used in R programming language. The fitting itself is done by application of nls () built-in function which implements the non-linear least squares technique. The Gaussian function was implemented in a custom-written code and is passed to nls () as a formula to fit experimental data to. Values of nls () function's control parameters, i.e., maximum number of iterations and convergence criterion, are set to the defaults provided in the R programming language.

The preliminary clusters that passed the statistical test are considered to be correctly grouped isotopic series and are excluded from further processing. The chi-square test replaces the empirical rules which are used in researches [3-6] when making a decision on retaining or splitting putative isotopic series based on distribution of peak intensities. In particular, this test prevents the incorrect merger of two overlapping isotopic series into one group. Furthermore, application of chi-square test allows the algorithm to retain peaks that lie out of expected range of search in isotopic series without the need to change the tolerable distance deviation (given that the observed and expected distributions of intensities match).

The final grouping of peaks in clusters that contain less than 4 peaks and in those that have failed the test is conducted by means of "naive" approach, i.e., by computing the distances between cluster peaks. This approach was slightly modified to enable grouping of isotopic series with various expected integer distances (1, 2, etc.) between consecutive peaks during the single run. In addition to traditional eps parameter (maximum tolerable deviation of distance between two peaks from expected integer value) our implementation of "naive" grouping requires the value of maximum expected integer distance dm{z ) e N to be specified. The implementation involves searching the cluster for isotopic peaks placed at distances of 1 ± eps, then 2 ± eps, etc. until dm(z) ± eps. The peaks merged into the isotopic series which is characterized by certain integer distance are excluded from further iterations. This modification was aimed at enabling the algorithm to be used for processing mass spectra where various classes of compounds (e.g., chlorides of mono-isotopic metals, dm(z> = 2, and organic impurities, dm^ =1 [15]) are observed simultaneously. It shall be noted that, like known linear models, the proposed approach assumes that peaks forming the isotopic series are placed at roughly equal distances from each other. Currently around 35 chemical elements are characterized by phenomenon when masses of heavier isotope and previous lighter isotope differ by the same integer mass value for all consecutive isotope pairs. The aforementioned assumption stands true for all ions which contain at least one such element in their composition. In practice, stated assumption allows correctly groping peaks of majority of registered ions into isotopic series. However, this still sets the constraint on algorithm's area of application that needs to be taken into account.

The output of the proposed algorithm is the n x 3 matrix, where n is the number of peaks detected on step 2. Matrix columns represented m / z, intensity and assigned number of isotopic series for each peak. Putative monoiso-topic peaks were also assigned a unique number.

Input data used. The proposed algorithm for grouping peaks into isotopic series was tested with experimental low-resolution mass spectra acquired on laser desorption mass spectrometer.

Mass spectra of silver(I) chloride and silver(I) bromide were selected as input data. The distribution of peak intensities obtained by convolution of "isotopic distributions" of several metals with one or two abundant isotopes, including silver, with respective distributions of chlorine and bromine could be relatively easily approximated by a Gaussian function. This allows for more detailed control over the testing procedure in general and chi-square test aspects in particular.

Selection of silver compounds from the list of aforementioned metals simplifies the construction of expected grouping results array, which is required to calculate metrics of algorithm's functioning quality. Such calculation requires testing the algorithm on a set of raw mass spectra with list of annotated peaks available for each of them. To the best of our knowledge, the sets of metal halides mass spectra acquired by laser desorption method are not available in open sources. Respectively, the required mass spectra had to be acquired, and observed isotopic peak series needed to be identified. The laws of ion formation during laser desorption ionization are well known for silver halides [16]. This allowed identifying the fragments of studied compounds manually and constructing the required validation data set that contained expected grouping results.

Four experimental mass spectra were selected in order to test the proposed algorithm: (1) AgCl (positively charged ions registration mode, the registration mode is later denoted with "+" or "-" signs); (2) AgBr (-); (3) AgBr (+); (4) AgCl with bromine impurities (-). All mass-spectra were acquired using Bruker Daltonics Ultraflex II mass spectrometer (operating wavelength 337 nm, pulse energy 100 uj, pulse power 43 kW, repetition rate 20 Hz). Matrix substance was not used, since examined compounds are readily ionized in the absence of matrix [16]. Input mass spectra were selected in a way so that the algorithm could be tested in various conditions.

Mass spectrum 1 represented nearly ideal conditions for grouping peaks into isotopic series. The distances between all consequent peaks belonging to single isotopic series were almost equal, and position of each consequent peak on m / 2 axis was within the expected search area.

When dimensionality reduction was carried out, the positions of several isotopic peaks in mass spectrum 2 were determined incorrectly (probably due to unsuitable size of window for local maxima search). Because of that, distances between some of the peaks deviated from expected integer value by up to 0.27 m / z. This presented an opportunity to check whether chi-square test could help in grouping peaks, which in fact belong to isotopic series but lie out of expected search range over m / z axis.

The detector overload (i.e., registration of multiple signals with decreasing intensities following the registration of "true" peak corresponding to fragment ion) was observed when mass spectrum 3 was acquired. The consequences of this effect were not fully removed during dimensionality reduction. This allowed testing algorithm's ability to correctly detect isotopic peak series in preliminary clusters that contain multiple high-intensity noise signals.

Mass spectrum 4 was characterized by presence of several isotopic peak series positioned one after another. The distances between the first peak in subsequent series and the last peak in previous series were approximately equal to the distances between two peaks within one isotopic series. This feature of mass spectra was used to test whether algorithm is capable of correctly splitting such peak sequences into isotopic series.

The manual identification of ions registered in used mass spectra was carried out prior to start of testing procedures. Identification was conducted by comparing theoretical distribution of peak intensities for a given molecular formula with the intensity distribution observed in the registered peak series. Theoretical distribution was calculated using IsoSpecR package [17]. The decision on whether the observed and theoretical intensity distributions match was based on a weighted sum score. For each peak from observed isotopic series, an absolute difference between relative intensity of observed peak and relative intensity of the theoretical peak closest to the observed one by m / z value was computed. If such absolute difference did not exceed user-defined threshold, a value equal to relative intensity of theoretical peak was added to the sum score (with weight of 1). Otherwise, sum score remained unchanged. The theoretical distribution was considered to match the experimental one if the resulting sum score equaled at least (1 — threshold). In this study, the threshold was set to value of 0.05 (e.g., a maximum 5 % difference in relative intensities of corresponding peaks). A total of 31 isotopic peak series, comprising 121 peaks, were identified across 4 mass spectra. Results of identification procedure were used to construct arrays of expected grouping results, which, in turn, allowed assessing quality of algorithm's results.

Testing procedure. The proposed peak grouping algorithm was implemented in R programming language [18]. Functionality of MALDIquant package was used to read raw files and reduce dimensions of mass spectra loaded into memory. The remaining algorithm steps were implemented in custom scripts.

A total of 27, 35, 160 and 70 peaks were detected in mass spectra 1 to 4, respectively. Several signals, which constituted the part of manually identified isotopic series according to theoretical intensity distributions but were not automatically detected as peaks, were added into the peak lists. Relative intensity of signal in theoretical distribution for identified series had to exceed 5 % in order for signal to be added into the list of peaks. No peaks detected on step 2 were removed from the peak list. Final number of signals classified as peaks amounted to 29, 36,162 and 72 in mass spectra 1-4, respectively.

The data obtained during manual identification and formed peak lists were used to obtain the matrices of expected results of grouping peaks into isotopic

series for each mass spectra. The same group number (label) was assigned to all experimental peaks which belonged to the same series. The numbering started from 1. Numbers were assigned consecutively and increased as lowest peak ml z value in series increased. The unidentified and noise peaks which were detected automatically were not included into the matrices of expected grouping results.

Detection of isotopic peak series and evaluation of proposed algorithm's quality were carried out subsequently. The value of e parameter of DBSCAN algorithm was set to 3, meaning that consequent peaks separated by a distance of no more than 3 m I z fell into one preliminary cluster. Isotopes of each chemical element present in studied compounds and their fragments (Ag, Br, CI) differ in mass by approximately 2 Da. Under such conditions, two consequent peaks located at a distance greater than 3 ml z from each other are guaranteed not to belong to a single isotopic series, with an account for all possible errors. The dmiz) and eps parameters of "naive" grouping method equaled 2 and 0.1, respectively. Value of dm(z) parameter was selected based on consideration regarding isotope mass differences mentioned previously. The maximum tolerable deviation of distance between two consequent peaks from expected integer number eps was set to 0.1, since all m I z values listed in official description of used mass spectrometer are specified with a precision of 0.1 m/z*. Critical values of chi-square test statistics were calculated at confidence levels cl of 0.9, 0.95 and 0.99 in different runs.

Several metrics were calculated in order to assess the quality of peak grouping that was attained by application of proposed algorithm. These metrics included the share of peaks assigned to correct isotopic series, shares of type I and type II errors, precision and recall. The total number of peaks that fell into preliminary clusters, which contained the most intensive peaks of identified isotopic series, was used as the base for calculation of the first metrics from the list. Within the scope of task solved, type I error (false-positive decision) means inclusion of experimental signal into isotopic series, to which this signal does not belong in reality. Respectively, type II error (false-negative decision) stands for exclusion of peak, which constitutes a part of isotopic series, from such a series. Precision and recall were calculated in a manner traditional to machine learning [19]. Several additional special metrics were calculated in order

* Measurement instrument type description. Mass spectrometers of models: micro-flex LT/SH, microflex LRF, autoflex speed LIN, autoflex speed LRF, autoflex speed TOF/TOF, ultrafleXtreme, rapitleX MALDI TOF MS, rapifleX MALDI TOF/TOF. Federal Agency on Technical Regulating and Metrology of Russia, 2021.

to assess impact of chi-square test. The detailed results are presented in the following section of the paper.

Results and discussion. Quality of obtained results. The metrics reflecting the quality of grouping peaks into isotopic series by means of proposed algorithm are presented in Table 1. In order to assess the effect from joint usage of DBSCAN clustering, chi-square test and "naive" grouping, the same metrics were also calculated for grouping results obtained by application of DBSCAN algorithm separately and of "naive" grouping separately.

Table 1

Quality metrics for the proposed algorithm and its individual components

Metrics DBSCAN "Naive" grouping Complete algorithm

cl = 0.9 cl = 0.95 cl = 0.99

Number of peaks that fell into clusters containing putative series 161 123 124

Share of peaks assigned to correct isotopic series, % 73.29 85.37 90.32 90.32 91.13

Share of type I errors, % 26.71 1.63 2.42 2.42 2.42

Share of type II errors, % 0.00 13.01 7.26 7.26 6.45

Precision, % 73.29 98.13 97.39 97.39 97.41

Recall, % 100.00 86.78 92.56 92.56 93.39

Absolute difference of "precision-recall", % 26.71 11.35 4.83 4.83 4.03

Number of isotopic series identified manually 31

Share of series detected fully correctly 58.06 51.61 67.74 67.74 70.97

The algorithm correctly assigned circa 90 % of all peaks to the same isotopic series as expected. Around 70 % of manually identified isotopic peak series were detected fully correctly. The selection of confidence level for calculation of chi-square critical value did not impact the results significantly.

The share of peaks assigned to correct isotopic series and share of isotopic peaks series detected fully correctly were lower for results obtained by applying only DBSCAN algorithm or only "naive" grouping method and amounted to around 73 / 58 % and 85 / 52 %, respectively. Thus, joint application of DBSCAN clustering algorithm, chi-square statistic test and "naive" grouping improves the results of grouping peaks into isotopic series when compared with usage of a single linear model.

We shall note that, among all the used methods of peak grouping, only application of the proposed algorithm in its complete form allowed reaching both relatively high (97 and 93 %, respectively) and most balanced precision and recall values. This indicates that shares of false-positive and false-negative decisions made by algorithm are low and approximately equal. In comparison, grouping of peaks into isotopic series by means of only DBSCAN algorithm allows obtaining a recall of 100 % (i.e., no type II errors). However, this clustering algorithm is not capable of detecting isotopic peak series in noisy data or handling overlapping series by its nature. All the noise and overlapping signals would be incorrectly grouped into one cluster, which will cause multiple type I errors. In practice this results in very low (73 %) precision of peak grouping when DBSCAN algorithm is used separately. Similarly, application of only "naive" grouping results in slightly greater precision when compared with the complete proposed algorithm. However, the model based purely on calculation of distance is prone to false-negative decisions in cases when subsequent isotopic peak lies out of its expected search range along m/z axis. As the result, the recall value attained by "naive" method is rather low and amounts only to 87 %.

Application of the proposed algorithm for grouping peaks into isotopic series allows achieving a compromise between shares of type I and II errors due to usage of chi-square test and processing of the preliminary cluster depending of results of such test. Table 2 shows that from 29 to 33 % of identified isotopic peak series, which were already grouped fully correctly at clustering stage, would have been incorrectly ungrouped by "naive" method. However, the distribution of peak intensities in such series matches the expected one, which allowed those series to pass statistic test and prevented incorrect processing. Depending of selected confidence level, 43 to 47 % of all conducted chi-square tests had a positive impact on result of peak grouping, and in further 33 % of cases such test did not harm the quality. It shall be separately noted that preliminary clusters containing multiple noise signals (common for mass spectrum 3) did not pass the test and were correctly passed to "naive" grouping step for a subsequent search for isotopic peak series.

Precision and recall metrics allow comparing the quality of peak grouping achieved by the proposed algorithm and other existing approaches. Application of presented algorithm results in precision value similar to the one of linear model described in [6] (97.4 % against 97.7 %, respectively) and, formally, a greater recall (92.6 % against 70.4 %, respectively). However, we shall note that the recall reported in [6] is suspiciously low (this might have been caused by some peculiarities of methodic that was used to count false-negative decisions) and requires clarification. When compared with machine learning-

based approach described in [7], the proposed algorithm, as expected, demonstrated lower quality of results (precision of 99.5 % for model [7] against 97.4 % for proposed algorithm; recall of 99.9 % and 92.6 %, respectively). In exchange, the proposed algorithm does not require generating training set that depends on external sources in order to tune model parameters.

Table 2

Detailed results of chi-square test usage within the proposed algorithm

Metrics Confidence level

0.9 0.95 0.99

Number of preliminary clusters containing isotopic peak series with expected distribution of intensities 21

Share of preliminary clusters that passed

chi-square test, %, including: 57.14 61.90 66.67

(I) were correctly grouped during preliminary

clustering and would not have been ungrouped

by "naive" method 28.57 33.33 33.33

(II) would have been ungrouped correctly 4.76 4.76 4.76

(III) would have been ungrouped incorrectly 23.81 23.81 28.57

Share of preliminary clusters that failed chi-square

test, %, including: 42.86 3840 33.33

(IV) were correctly grouped during preliminary

clustering but were incorrectly ungrouped 4.76 0 0

(V) were correctly ungrouped 19.05 19.05 19.05

(VI) were incorrectly ungrouped 19.05 19.05 14.29

Share of "useful" chi-square test applications

(III + V), % 42.86 42.86 47.62

Share of "harmless" chi-square test applications (I + IV), % 33.33 33.33 33.33

Generalization of algorithm for various compound classes. This paragraph shortly describes the adjustment of parameter values required for algorithm to be used with mass spectra of various compound classes.

The exact values of dimensionality reduction and peak detection parameters do not set technical limitations on ability to run the algorithm. However, in cases when all the consequent peaks in resulting peak list are separated by a distance less than e parameter value of DBSCAN algorithm due to "soft" peak detection settings, the grouping will only be conducted by "naive" method, since all the detected peaks will fall into one preliminary cluster, which

is unlikely to pass chi-square test. Thus, "soft" dimensionality reduction and peak detection parameter values should be avoided.

Selection of e parameter value for DBSCAN algorithm should be based on computing the maximum difference of masses of two consequent naturally occurring isotopes among all chemical elements that are expected to be found in examined compound. Elements, for which such difference amounts to ~ 1, 2 and 3 Da, are currently known to exist. To account for all possible measurement errors, we suggest setting the value of e parameter to 2, 3 or 4 depending on expected composition of studied compound. For example, e = 2 is suitable for relatively simple organic compounds, value of e = 3 is suitable for chlorides and bromides of monoisotopic metals, etc.

The most important step of adjusting algorithm's parameters is the selection of intensity distribution pattern, which will be used to compare the distribution of intensities observed in preliminary cluster with distribution that a valid isotopic series is assumed to have. Selection of such pattern requires expert judgment. For example, probability function of geometric distribution might be used to approximate distribution of mass peak intensities of organic compounds that contain only C, H, N, O, S and P. A mixture of distributions could also be used for approximation purposes. Say, distribution of mass peak intensities of lead chlorides could be modeled by a mixture of two normal distributions with centers at m/z values of peaks characterized by largest input from 206Pb and 208Pb isotopes. In practice, distributions of mass peak intensities observed for registered ions usually match one of the patterns from a finite list [20]. Thus, once the approximation law is formulated, it can be re-used.

Finally, as discussed previously, value of dmiz) parameter required by "naive" grouping step should be set to 1, 2 or 3, depending on the expected composition of analyzed compound. Selection of maximum tolerable distance deviation eps does not depend on the nature of analyte and should be based on specifications listed in the technical documents for the used mass spectrometer.

Conclusion. The algorithm for grouping peaks detected in low-resolution mass spectra into isotopic series was presented in this paper. The algorithm is based on joint application of DBSCAN clustering, chi-square test and a linear model implying calculation of distances between consequent peaks. When used to process experimental mass spectra of silver(I) chloride and silver(I) bromide, the proposed algorithm showed relative insensitivity to several phenomena which commonly occur in low-resolution mass spectra and hinder automatic detection of isotopic peak series by known approaches. The algorithm demonstrated precision and recall values similar to the ones of existing linear models that are used to process high-resolution mass spectra. At the same time,

the proposed algorithm does not use empirical rules valid only for certain compound class when making a decision whether a peak belongs to putative isotopic series or not. In comparison with peak grouping models based on machine learning, proposed algorithm demonstrates moderately lower quality of results in exchange for dropping the dependence on external data and pre-training. The most significant limitation of the proposed algorithm is inability to correctly group peaks originating from ions that are composed only of certain elements. Algorithm is also characterized by one feature, which, in our opinion, is neither the advantage nor limitation. In order for algorithm to be effectively used with mass spectra of compounds belonging to a class not studied previously, the law of approximation of mass peak intensities distribution needs to be formulated. On one hand, this feature implies additional time costs when mass spectra of different compound classes need to be processed. On the other hand, such a feature allows easily adapting the proposed algorithm to work with mass spectra of various compound classes without the need to change the general logic of the algorithm. Existing linear models used for processing of high-resolution mass spectra are highly selective to input data and do not provide such scaling capabilities. Thus, application of the proposed algorithm seems appropriate in cases when multiple compounds of the same class, for which a law of approximation of mass peak intensities is known, are analyzed using the low-resolution mass spectrometer.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

REFERENCES

[1] Bauer C„ Cramer R., Schuchhardt J. Evaluation of peak-picking algorithms for protein mass spectrometry. In: Hamacher M„ Eisenacher M„ Stephan C. (eds). Data Mining in Proteomics. Methods in Molecular Biology, vol. 696. Humana Press, 2011, pp. 341-352. DOI: https://doi.org/10.1007/978-l-60761-987-l_22

[2] Yang C„ He Z„ Yu W. Comparison of public peak detection algorithms for MALDI mass spectrometry data analysis. BMC Bioinformatics, 2009, vol. 10, art, no. 4.

DOI: https://d0i.0rg/l 0.1186/1471 -2105-10-4

[3] Jurasek P., Slimak M., Kosik M. Determination of isotope cluster patterns in mass spectra of GC-MS Analyses by a chemometric detector. Microchim. Acta, 1993, vol. 110, pp. 133-142. DOI: https://doi.org/10.1007/BF01245097

[4] Trcutlcr H., Neumann S. Prediction, detection, and validation of isotope clusters in mass spectrometry data. Metabolites, 2016, vol. 6, iss. 4.

DOI: https://doi.org/10.3390/metabo6040037

[5] Teo G.C., Polasky D.A., Yu F„ et al. Fast deisotoping algorithm and its implementation in the MSFragger search engine. /. Proteome Res., 2019, vol. 20, iss. 1, pp. 498-505. DOI: https://doi.org/10.1021/acs.jproteome.0c00544

[6] Tay A.P., Liang A., Hamey J.J., et al. MS2-Deisotoper: a tool for deisotoping high-resolution MS/MS spectra in normal and heavy isotope-labelled samples. Proteomics, 2019, vol. 19, iss. 17, art. 1800444. DOI: https://doi.org/10.1002/pmic.201800444

[7] Boiko D.A., Kozlov K.S., Burykina J.V., et al. Fully automated unconstrained analysis of high-resolution mass spectrometry data with machine learning. /. Am. Chem. Soc., 2022, vol. 144, iss. 32, pp. 14590-14606. DOI: https://doi.org/10.1021/jacs.2c03631

[8] Brenton A.G., Godfrey A.R. Accurate mass measurement: terminology and treatment of data. /. Am. Soc. Mass Spectr., 2010, vol. 21, iss. 11, pp. 1821-1835.

DOI: https://doi.Org/10.1016/j.jasms.2010.06.006

[9] Urban I„ Afseth N.K., Stys D. Fundamental definitions and confusions in mass spectrometry about mass assignment, centroiding and resolution. TrAC, 2014, vol. 53, pp. 126-136. DOI: https://doi.Org/10.1016/j.trac.2013.07.010

[10] Gibb S., Strimmer K. MALDIquant: a versatile R package for the analysis of mass spectrometry data. Bioinformatics, 2012, vol. 28, iss. 17, pp. 2270-2271.

DOI: https://doi.org/10.1093/bioinformatics/bts447

[11] Li X., Gentleman R., Shi Q., et al. SELDI-TOF mass spectrometry protein data. In: Gentleman R„ Carey V.J., Huber W., Irizarry R.A., Dudoit S. (eds). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Statistics for Biology and Health. New York, Springer New York, 2005. pp. 91-109.

DOI: https://doi.org/10.1007/0-387-29362-0_6

[12] Ester M„ Kriegel H.-P., Sander J., et al. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD-96 Proc. AAAI Press, 1996, pp. 226-231.

[13] Wei X., Shi X., Kim S„ et al. Data dependent peak model based spectrum deconvo-lution for analysis of high resolution LC-MS Data. Anal. Chem., 2014, vol. 86, iss. 4, pp. 2156-2165. DOI: https://doi.org/10.1021/ac403803a

[14] Pearson K.X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 1900, vol. 50, iss. 302, pp. 157-175.

DOI: https://doi.org/10.1080/14786440009463897

[15] Lebedev V.V., Buryak A.K. Usage of Kohonen clustering algorithm for rough peak detection during mass spectrum preprocessing. Mass-spektrometria [Mass-Spectrometry], 2022, vol. 19, no. 3, pp. 137-148 (in Russ.). EDN: NVGYFO.

DOI: https://d0i.0rg/l 0.25703/MS.2022.19.15

[16] Pytskii I.S., Buryak A.K. MALDI/SELDI mass-spectrometric surface investigation of AMg-6 and Ad-0 materials. Prot. Met. Phys. Chem. Surf, 2011, vol. 47, iss. 6, pp. 756-761. DOI: https://doi.org/10.1134/S2070205111060165

[17] Lacki M.K., Startek M„ Valkenborg D„ et al. IsoSpec: hyperfast fine structure calculator. Anal. Chem., 2017, vol. 89, iss. 6, pp. 3272-3277.

DOI: https://doi.org/10.1021/acs.analchem.6b01459

[18] R: A language and environment for statistical computing. R Foundation for Statistical Computing. Available at: https://www.R-project.org (accessed: 23.05.2023).

[19] Olson D.L., Delen D. Performance evaluation for predictive modeling. In: Advanced Data Mining Techniques. Berlin, Heidelberg, Springer, 2008, pp. 137-147. DOI: https://doi.org/10.1007/978-3-540-76917-0_9 2008

[20] Goldfarb D., Lafferty MJ., Herring L.E., et al. Approximating isotope distributions of biomolecule fragments. ACS Omega, 2018, vol. 3, iss. 9, pp. 11383-11391.

DOI: https://doi.org/10.1021/acsomega.8b01649

Lebedev V.V. — Junior Researcher, Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences (Leninskiy prospekt 31, korp. 4, Moscow, 119071 Russian Federation).

Pytskii I.S. — Cand. Sc. (Chem.), Leading Researcher, Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences (Leninskiy prospekt 31, korp. 4, Moscow, 119071 Russian Federation).

Buryak A.K. — Corresponding Member of the Russian Academy of Sciences, Dr. Sc. (Chem.), Director, Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences (Leninskiy prospekt 31, korp. 4, Moscow, 119071 Russian Federation).

Please cite this article as:

Lebedev V.V., Pytskii I.S., Buryak A.K. Detection of isotopic peak series in low-resolution mass spectra using clustering algorithm and chi-square test. Herald of the Bauman Moscow State Technical University, Series Natural Sciences, 2024, no. 2 (113), pp. 149-164. EDN: ODEQRN

i Надоели баннеры? Вы всегда можете отключить рекламу.