The correlation matrix (Figs 2, S1) demonstrates that spectra of different fragments of the same tissue sample correlate better with each other than with the spectra of fragments of other tissue samples. There are also white lines on the figure, which are associated with electrospray instability (these lines in the correlogram correspond to anomalous local maxima or minima in the total ion current).
Pearson’s r coefficient and cosine measure
The Pearson’s r coefficient becomes strictly equal to the cosine measure when the mean is equal to 0, what is evident from formulas (1) and (2). Therefore, since the average values of the peak intensities in a spectrum is close to zero, the cosine measure and the Pearson’s r coefficient between the two spectrum vectors are very similar (Fig. 3). In general, the Pearson’s r coefficient may be negative, although this is unlikely in our case as each scan is a vector of nonnegative values with a mean close to zero. Furthermore, in most scans peaks share their positions so Pearson’s r coefficient even if it is negative has a low absolute value, which can be neglected. At the same time, the cosine measure between vectors with nonnegative coordinates will always be greater than zero. Considering the fact that the cosine measure between the spectrum vectors varies between 0 and 1, as do the Pearson’s r coefficients between vectors, then it can be regarded as a metrics of scan similarity. Calculating the cosine measure requires slightly less computations due to the fact that, for the numerator and denominator, it is not necessary to calculate the mean value of the vector and then subtract it from each vector; therefore, calculating the cosine measure between vectors is preferable when calculating similarities between vectors of high dimensionality.
The visualization of the correlation matrix clearly demonstrates that brain tissues from different non-tumor pathology patients are significantly different from each other (due to natural biological variability), while also being significantly different from tumor tissues. Meanwhile, the tumor samples are quite similar among each of the patients but visibly differ across the diagnoses (Fig. 3). Furthermore, stable changes in molecular profiles occurring during each measurement could be easily found on the correlation matrix. For example, the part of the matrix corresponding to the second meningioma sample demonstrates that the molecular profile changes at the end of the experiment, what could be explained by partial exhausting of the sample during the investigation. However, the last scans of this sample are in good correlation with the initial scans of the first meningioma sample, meaning that these tumor samples are not homogeneous on the molecular level (as the histology examination describes them). Additionally, this demonstrates that NESI is sensitive to sample orientation as exhaustion from the tissue is changed between the parts coming from one sample. Formally the exhaustion is not observed during 5 minutes of spectra measurement. Thus, detection of such cases is vital for effective tissue classification since features of distinctive tissue parts may not be detected in proper proportions in each moment of the measurement and either splitting of spectra into scans or multilabel classification algorithms should be applied.
The fact that the spectrum is normalized to the total ion current, instead of normalizing to the maximum intensity of the spectrum, affects neither the Pearson’s r coefficient or the cosine measure value, as follows from formulas 1 and 2: the cosine measure is independent of the vector length and the Pearson’s r coefficient does not depend on the normalization coefficient. Similarly, with a proportional change in spectrum magnitude, no changes are observed in the calculations of Pearson’s r coefficients or cosine measures. Thus, this technique allows us to automatically take into account as identical those spectra whose intensities are different due to variations in the total ion current values only.
Spectra smoothing for measurement artifacts filtration
Mass spectrometric profiling of biological samples is usually based on the principle of assigning one specific profile (a mass spectrum) to a sample. In most protocols, this spectrum is simply the sum of the scans accumulated during one measurement. However, it can be improved by discarding the outlier scans that show various anomalies in the measurement process. For this purpose, the scans should be filtered and averaged only afterwards.
For long-term measurements, when multiple mass spectra (scans) are subsequently measured and analyzed, it is necessary to understand whether or not each particular scan should be taken into account. Most statistical and machine-learning algorithms are sensitive to outliers. That is why detection and filtering of outliers is a necessary step for training classifiers. For the detection of outliers the cosine measure between the scan and the smoothed aggregate of all scans in the spectra have to be calculated; then, one can evaluate whether the analyzed scan is a measurement artifact by applying a significance threshold to the calculated value.
The method of median smoothing involves, in a certain sense, adjusting the sensitivity of the obtained spectrum to the signal instability caused, in particular, by the instability of ESI-based sources. This approach makes the stability and reproducibility of this analytical method less sensitive to occasional “bad” scans. In a very unstable spray, a wide smoothing window should be selected for median smoothing to make spectra suitable for automated data processing. In the case when there are no outliers and the stability of the spray and spectra is high enough, the filtering step could be omitted to preserve data for detailed analysis of the ionization process or sample heterogeneity.
Four different windows for the moving median of 5, 7, 21 and 51 scans were considered in order to select the optimal size of the smoothing window (Figs 4, S2). Odd numbers were required to simplify the calculations of median values. The values of 5 and 7 were tested to estimate the number of single spectra with instabilities and to observe their influence on the final result.
As can be seen from Fig. 4B,C, the windows of 5 or 7 scans do not have any fundamental differences; however, they do not filter the outliers (bands with low values) well enough. This is due to the fact that the ICR scans faster if there is no signal (ions in the Penning trap). Thus a more serious smoothing windows were chosen: 51 and an intermediate value of 21 scans. These values correspond to approximately one minute and 30 seconds of averaging, respectively. There is a big difference between the results obtained with windows of 7 and 21, while there is no significant difference between those with 21 and 51. As expected, as the window size increases, the spectra become more similar to each other. The 21-scan anti-aliasing window removes all single outliers, as does the window with 51 scans, but the second requires fewer calculations and makes little difference in the final correlation matrix.
We also analyzed the influence of the smoothing window step size on the final results, i.e., the first spectrum was obtained by smoothing over the first 51 scans, and then the subsequent spectra were smoothed over 51 scans beginning from (one + step size) scan. This approach allows to reduce the number of calculations, as well as eliminate anomalous scans and spectra, without losing any information that characterizes the studied sample. Three step values were chosen: 1, 25 and 51 (Fig. 5).
From Fig. 5, a step of 1 is optimal from the point of view of the smoothness of the cosine measure matrix and the absence of rough changes from one smoothed scan to the next. However, it does not reduce the number of calculations. Visually, there is a slight difference between the selected steps. A large smoothing step is optimal for solving problems related to the global grouping of different samples, while a small smoothing step allows the researcher to obtain more detailed information on spectrum changes during the analysis, which may be important when working with highly heterogeneous samples, or in methodological work to determine the optimal parameters and conditions of ambient ionization.
Thus, depending on the task and experimental conditions, it is possible to select the optimal window and smoothing step values. Therefore, for our data and tasks regarding lipid tissue profiling, the use of a moving median with smoothing window of 51 scans is optimal and corresponds to approximately one minute of actual sample measurement. However, this parameter should be optimized with changes in the experimental conditions.
Metrics application in direct tissue profiling experiments
The developed analysis method was used to estimate the selectivity, reproducibility and stability of lipid profiles measured using the NESI method. For these purposes, the data obtained from two samples with different diagnoses, each of the which had been previously divided into three fragments, was analyzed (Fig. 6). It follows from the constructed matrix that the lipid profiles obtained from different fragments are highly reproducible within a single sample and there are large differences between the samples, but for accurate tissue classification statistical analysis and application of classifiers are still required. For this reason metrics calculation with suitably selected averaging parameters allows to select only reliable scans for further automated analysis.
A method for analyzing the stability and reproducibility of mass spectrometric data, obtained in the result of a consecutive accumulation of a large number of mass spectra, has been developed in this study. This method makes allows to filter data for the purpose of further clustering and classification. A method of multidimensional data analysis has been proposed based on the cosine measure between multidimensional vectors and the Pearson’s r coefficient calculations. The proposed analysis method uses median smoothing and facilitates the identification of groups of scans in one measurement, which can be considered as stable and characteristic of the sample. It also eliminates measurement artifacts from consideration, what is especially important for projects dealing with automatic classification of samples based on mass spectrometric profiling methods. The visualization of cosine measure matrices can be used as a simple method for selection of optimal ionization and measurement conditions, in order to evaluate the stability and reproducibility of signals during experimental work using ambient ionization methods.