B-factor correlations: Sequence, structure, and experimental
We considered a high-resolution protein (2.25 Å) that is involved in amino acid catabolism, acyl-CoA dehydrogenase (1JQI), as an example case to examine the B-factor profiles and predicted contact maps using Seq-GNM. Coevolution analysis using direct coupling analysis (DCA) has been shown to recapitulate accurate structural contact maps for a wide range of proteins [21,23,24,27,31,45]. As expected, the contact maps of Seq-GNM and structural GNM are similar (Fig 1). In a comparison of their B-factor profiles, both Seq-GNM and structural GNM exhibit good agreement with observed B-factors, capturing flexible and rigid positions. Using evolutionary coupling (EC) values obtained from RaptorX, the correlation between the Seq-GNM and observed B-factors is 0.77, whereas the correlation between the structural GNM and observed B-factors is 0.57 (Fig 1A). Similarly, using EC values obtained by EVcouplings produced a correlation of 0.60 between the Seq-GNM and observed B-factors (Fig 1B). The scores obtained from EVcouplings are still reasonable, yet relatively lower correlations compared to those obtained by the RaptorX. This is likely due to the relatively noisy contact map predictions by EVcouplings compared to the more reliable contact maps produced by RaptorX (we think this is due to their inclusion of multiple orthology protein families) .
Fig 1. B-factors plots.
A plot of theoretical B-factors as calculated by our Seq-GNM (blue), the original GNM obtained from structure (orange), and observed experimental B-factors (black) for acyl-CoA dehydrogenase (PDB id: 1JQI) along with predicted contact maps by Seq-GNM (using a threshold score, shown as blue) and the contact map of the structure (using 10Å cut-off distance) (a) The Seq-GNM with values obtained from RaptorX produced a correlation of 0.56 with experiment, and 0.77 with the GNM obtained from structure. Moreover, the contact maps reveal the predicted contacts between the Seq-GNM and structural GNM approaches are remarkably similar. (b) The Seq-GNM that uses values obtained from EVcouplings produced a correlation of 0.60 with experiment, and 0.68 with the GNM obtained from structure. The B-factor obtained by applying GNM to the experimental structure yields a correlation of 0.57. The contact map captures the dominant contacts with noise coming from poorly predicted EVcouplings scores.
The Seq-GNM produces a correlation with crystallographic B-factors of 0.60, which is within the same range as those produced by the GNM from structure of 0.57. Moreover, theoretical B-factor profiles obtained from both methods were able to identify the catalytic sites on all of the proteins.
As a further test of the efficacy of the Seq-GNM, we superimposed the predicted B-factors onto the structures of three diverse proteins– 5′(3′)-deoxyribonucleotidase (2JAO), acyl protein thioesterase (1FJ2), and NADH-cytochrome b(5) reductase (1UMK)–to visually contrast the predicted B-factors with that of experiment. Fig 2 shows each protein color-coded according to their B-factor profile on a spectrum of blue–white–red, where blue represents the lowest B-factors (less mobility) and red represents the highest B-factors (more mobility). The left panel shows the experimental B-factors for each protein, while the right panel shows the theoretical values predicted by the Seq-GNM. We investigated whether secondary structure was a factor in how the B-factors were distributed across the protein, and if certain secondary structure domains would exhibit less agreement with experiment. In this context, the proteins were selected so that they had a variety of secondary structure components–2JAO contains primarily alpha helices, 1UMK is mainly composed of beta-sheets, and 1F2J is a combination of alpha helices and beta-sheets. For 2JAO, the exterior helices that are flexible (red) in the observed structure are all reproduced in the predicted structure. The one highly rigid (blue) helix in the observed structure was more flexible in the predicted structure but was still in overall agreement. There is a surprising amount of similarity between the observed and predicted structure of 1F2J, considering that it contains both alpha-helix and beta-sheet elements. Similarly, 1UMK showed good agreement, except for some miniscule differences. This gives further evidence that the magnitudes of residue fluctuations predicted by the Seq-GNM model is representative of the crystallographic B-factor profiles for many proteins.
Fig 2. Color-coded ribbon diagrams using experimental and theoretical B-factors obtained by Seq-GNM.
The observed crystallographic B-factors (left) and the predicted B-factors from the Seq-GNM superimposed on the structure. The three proteins selected–2JAO, 1F2J, and 1UMK–were high-resolution structures are better than 2.0 Å. The B-factors are color-coded according to their B-factor profile on a spectrum of blue–white–red where blue represents the lowest B-factors (less mobility) and red represents the highest B-factors (more mobility). The B-factor scores were converted to a percentile rank so that they could be compared across different proteins. Each protein was also rotated 180° so that both sides could be visualized and compared. Moreover, the proteins were selected so that they had a variety of secondary structure components–2JAO contains primarily alpha helices, 1UMK is mainly composed of beta-sheets, and 1F2J is a combination of alpha helices and beta-sheets.
In order to compare predicted B-factors with crystallographic B-factors, we extracted a subset of 39 structures that had a resolution better than 2.0Å to obtain more realistic crystallographic B-factors (unreliable B-factors are common for many PDB structures) [18,46]. The same cutoff of 2.0Å was used in an earlier study to compare GNM predicted B-factors with those determined by crystallography . For all 39 structures, the Seq-GNM (using EC values from RaptorX) and structure GNM were used to estimate their B-factors, which were then compared with the observed B-factors by calculating the correlation for each protein. The mean correlation coefficient for the Seq-GNM was 0.53 while the mean correlation coefficient for the structure GNM was 0.58. The correlation of 0.58 for structural GNM of our smaller data set is consistent with the findings of Kundu et al. where 113 high-resolution structures (resolution <2.0 Å) were used and, the mean correlation coefficient with observed B-factors was 0.59 .
As shown in Fig 3A, boxplot distributions reveal that correlations are not significantly different between the sequence and structure GNM (p = 0.055 in a student t-test). The structure GNM appears to perform only slightly better than the Seq-GNM. Fig 3B shows the same distribution separated into 10 individual bins of size 0.1. The overall shapes of the two distributions are similar, except for the exaggerated relative lower second peak of the Seq-GNM at 0.4. It should also be noted that for these cases where Seq-GNM had low correlations, the EC threshold could be tuned to yield much higher correlations. If this were done on a case-by-case basis, the overall correlation distributions would be even more similar. Thus, the EC threshold may be used as a tuning parameter to enhance the correlation coefficient for purposes of model optimization.
Fig 3. Comparison of B-factors obtained by GNM and Seq-GNM with experimental B-factors.
(a) Boxplot showing the correlation of predicted B-factors by the Seq-GNM with experimentally observed B-factors (blue) in comparison to that of the GNM obtained from structure (orange) for a subset of 39 structures with resolution better than 2.0 Å. (b) A distribution plot of the same correlations binned into 10 bins with sizes of 0.1. A student t-test revealed no significant difference between the two distributions (p = 0.055) indicating that the Seq-GNM is producing competitive results compared to the original GNM from structure. The mean correlation of the Seq-GNM is 0.53 while that of the GNM from structure is 0.58.
Interestingly, for the cases where predicted B-factors by Seq-GNM yielded significantly better correlations with the experimental B-factors than those obtained by GNM from structures, we observed that biological units of these proteins are assigned as oligomeric forms. While predicted B-factors obtained using Seq-GNM does not retain this information, it successfully predicts the experimentally low B-factor values of interface positions as shown for protein 5′(3′)-deoxyribonucleotidase (2JAO) and protein aldehyde Dehydrogenase 7A1 (2J6L) in Fig 4. It is indeed shown in earlier work of direct contact analysis that co-evolution can identify positions of protein interfaces and protein-protein interaction partners and successfully reconstruct protein complexes and interaction network [23,30,48]. Thus, it is not surprising to see that it yields good correlations with the experimental B-factors. Conversely, predicted B-factors from structure can only improve when the oligomeric structure is used for the GNM analysis.
Fig 4. Comparison of B-factors obtained from experiments, Seq-GNM, GNM from monomeric structure, and GNM from oligomeric structure.
B-factors are shown on the respective structures for (a) 5′(3′)-deoxyribonucleotidase (2JAO) and (b) Aldehyde Dehydrogenase 7A1 (2J6L). (a) The correlation of Seq-GNM to experimental B-factors is 0.83 while correlation of GNM B-factors obtained from monomer to experimental B-factors is 0.63. When dimer is used for GNM analysis the correlation of GNM B-factors obtained from monomer to experimental B-factors increase to 0.72. (b) The correlation of Seq-GNM to experimental B-factors is 0.61 while correlation of GNM B-factors obtained from monomer to experimental B-factors is 0.37. When tetramer is used for GNM analysis the correlation of GNM B-factors obtained from monomer to experimental B-factors increase to 0.76. The change in correlation for GNM between monomer and oligomer clearly shows the drawback for dependence on the crystal structure of biounits. However, Seq-GNM captures the interface B-factors correctly.
Even when using high-resolution X-ray structures, there is still some uncertainty about the realistic nature of crystallographic B-factors. For this reason, we thought a more plausible way to determine the efficacy of the Seq-GNM was to compare it directly with the structure GNM. The structure GNM is a robust method to describe thermal fluctuations in a protein, and in many cases, it performs as good or better than the ANM or MD [47,49]. We systematically evaluated the performance of the Seq-GNM and structure GNM for the entire set of 139 structures and obtained the correlation coefficients for each protein (Fig 5).
Fig 5. Distribution of correlation coefficients.
The distribution of correlation coefficients between B-factors from Seq-GNM and GNM from structure. (a) The average correlation coefficient is 0.63 with RaptorX EC values. (b) The average correlation coefficient is 0.43 by using EVcouplings EC values.
The average correlation of B-factors between the Seq-GNM and structure GNM model is 0.63 when using EC contacts from RaptorX and 0.43 when using contacts from EVcouplings. As seen in Fig 5A, the distribution of correlation coefficients increases until 0.8, and then subsequently decreases. Interestingly, there are still an appreciable number of sequences yielding high correlations from 0.8 to 1.0. A distinguishing feature of the distribution is the pronounced peak in the bin from 0.7 to 0.8, indicating that significant fraction of our data set yields high correlations between 0.7 and 0.8. This is evidence that the Seq-GNM is efficiently capturing protein dynamics and supports the theory that ECs can be used as a substitute to 3D structure contacts in the GNM and still produce reliable dynamics profiles. The results of Seq-GNM based on contacts predicted by RaptorX usually yields B-factors that are closer to experimental B-factors as it uses structural information in its neural networks leading to better EC values and correlations with structure .
Assessing nSNV phenotypes using the Seq-GNM
Crystallographic B-factors have previously been used to assess the impact of nSNVs on protein function [18,50–54]. A study  found that mutations on lysozyme that impaired function exhibited lower than average temperature factors, suggesting that rigid sites on the protein are more susceptible to destabilizing nSNVs than flexible sites . Another study revealed a relationship between crystallographic B-factors and the impact of nSNVs on protein function . A commonly used tool to diagnose neutral and disease associated nSNVs, PolyPhen-2, uses evolutionary information, structural information, and crystallographic B-factors in its prediction model . These studies indicate that crystallographic B-factors can be used to predict the tolerance of a given residue to an nSNV (i.e., whether or not the occurrence of an nSNV would impact function).
We investigated whether B-factors predicted by the Seq-GNM were indicative of biological phenotype for nSNVs in the human population. A total of 738 nSNVs were mapped to the 139 enzymes, where 436 are disease-associated and 302 are neutral. S1 Table shows the number of disease and neutral nSNVs that occur on each protein. The Seq-GNM (using EC contacts from RaptorX and EVcouplings) was computed systematically for all 139 enzymes to obtain their dynamics profiles. The theoretical B-factors scores were converted into a percentile rank so that the values could be compared across different proteins.
We initially looked at two human enzymes, human lysozyme (PDB: 1C7P) and human cytochrome reductase (PDB: 1UMK). They were chosen because they were short proteins that each contain a disease and neutral nSNV. Human lysozyme is a glycoside hydrolase that functions in the immune system by causing damage to cell walls of bacteria. Human cytochrome b5 reductase is involved in many oxidation/reduction reactions including converting methemoglobin to hemoglobin .
Each structure is color-coded according to its theoretical B-factor profile on a spectrum of blue–white–red. Sites that exhibit high mobility (flexible) are red, and sites that have low mobility (rigid) are blue. Regions that are characterized by low mobility are usually important for maintaining stability and function, thus a mutation could act to destabilize the protein and impair its function. Fig 6A show the disease mutation I56T occurring on a rigid site with a B-factor of 0.0075. The neutral mutation T70N has a B-factor of 0.96 indicating that it is a highly mobile site. Both I56T and T70N occur on loop regions. Although loops are generally more flexible, three alpha-helical domains encompass the loop containing I56T, which implies that it may be involved in interactions that contribute to stabilizing the functional conformation. Thus, the I56T mutation may disrupt these critical interactions and impair the enzymatic function. In the case of cytochrome reductase (Fig 6B), the disease mutation R57Q is also on a rigid site with a B-factor of 0.14. Instead of being located near the core, R57Q is highly exposed protruding outwardly from a beta-barrel. However, since beta-barrels often harbor functional residues, the R57Q mutation may disrupt certain interactions critical for modulating function. The neutral mutation T116S is located on a loop and has a B-factor of 0.96, indicating that is it has a high mobility. In our earlier proteome wide study of over 100 human protein structures, we have shown that sites that are highly flexible (e.g., loop regions, or superficial sites) are typically more robust to mutations. Conversely, rigid sites are more susceptible to mutations that may disrupt function [18,19]. For these two cases, the B-factors produced by Seq-GNM successfully distinguished between the disease and neutral nSNVs, without using the 3D structures.
Fig 6. Comparison of theoretical B-factors on disease versus neutral mutant sites.
A ribbon diagram for two human enzymes, human lysozyme (a) and cytochrome reductase (b) colored according to their predicted B-factors by the Seq-GNM. Red indicates high mobility sites, and blue indicates low mobility sites. Each protein contains two known nSNVs. I56T and R57Q are disease-associated, and they occur on low mobility (rigid) sites. Conversely, the neutral nSNVs T116S and T70N occur on high mobility sites.
These findings prompted us to analyze the proteome-wide set of 139 enzymes to determine if the B-factors were indicative of phenotype for all 436 disease and 302 neutral nSNVs. The raw B-factor values were converted into a percentile rank (%B-factor) and then binned into 5 bins of size 0.2. We computed the observed-to-expected ratio of B-factors, where the expected values were based on the B-factor distribution of all 51,618 sites across all 139 proteins, and the observed values were based on the B-factors of the 436 disease sites. The same process was done for the 302 neutral nSNVs. Under the null hypothesis that predicted B-factor of the disease associated nSNVs yields similar distribution of all the positions gathered from 139 enzyme sequences, the ratio of expected and observed sites harboring disease mutations for each %B-factor bin should be close to 1, which would imply that B-factor does not distinguish sites that are prone to disease. This is the null hypothesis that disease sites are distributed uniformly between sites with low and high mobility. However, the null hypothesis was rejected for the 436 disease nSNVs (p <0.001). Fig 7 shows the observed-to-expected ratio plot of disease and neutral nSNVs, which indicates that disease nSNVs are overabundant at low %B-factor sites (<0.4) and under abundant at high %B-factor sites. Conversely, neutral nSNVs are overabundant at high %B-factor sites (>0.6) and under abundant at low %B-factor sites. This evidence suggests that the occurrence of an nSNV on a site with a low B-factor is likely damaging based on the position irrelative of the substitution. This is in agreement with our previous proteome-wide study showing that substitutions at rigid sites are more often associated with diseases . Conversely, an nSNV on a high B-factor site is usually benign. Low B-factors usually signify a residue that is crucial for modulating functional motions (e.g., a hinge). Thus, mutations on these sites can severely impact function. High B-factor sites are more flexible (e.g., loops) and more robust to mutations. Fig 7 suggest that it is possible to use the predicted B-factors to discriminate between disease and neutral nSNVs using co-evolution obtained from only multiple sequence alignment. Moreover, it can be used as an additional feature for in silico predictions .
Fig 7. Observed to expected ratio plots for disease and neutral nSNVs.
The relationship of observed-to-expected numbers between 436 disease nSNVs (red) and 302 neutral nSNVs (blue) from 139 human enzymes. The %B-factor scores derived from the Seq-GNM are binned into 5 bins of size 0.2.
Predictive models were created using logistic regression as the classification algorithm, 80% of the data was used for training and 20% for testing for 10 randomized sets. Models were evaluated based on ROC curves and their respective area under curve (AUC), the best performance is labeled as AUC_max and average performance as AUC. Theoretical B-factors obtained by Seq-GNM, experimental B-factors, and evolutionary parameters were used as predictive variables for training and testing (Fig 8). Seq-GNM and experimental B-factors have similar performance (maximum AUC of best 0.76 and 0.75, respectively), with Seq-GNM overshadowing experimental B-factors on average (AUC of 0.69 and 0.60, respectively). The ~0.70 AUC of B-factors obtained from Seq-GNM is impressive, as it has been shown that majority of state-of-art methods also yields similar AUC in independent tests [5,13]. Moreover, incorporation of Seq-GNM as an additional feature with evolutionary parameters resulted in higher prediction performance. While the AUC scores obtained using the evolutionary features for classification gives 0.76, this is increased to 0.81 after including the B-factors of Seq-GNM (Fig 8C and 8D). This result also demonstrates the efficacy of Seq-GNM in disease prediction as a complementary metric to other metrics used as features in classifiers.
Fig 8. ROC curves for disease prediction performance comparing Seq-GNM, experimental B-factors and evolutionary parameters.
ROC curves are plotted using 10 randomly selected training and testing data sets using 80%, and 20% of the data, respectively. (a) ROC curve of Seq-GNM. (b) ROC curve of experimental B-factors. (c) ROC curve of evolutionary parameters, where primate, mammal, and vertebrate fitch rates using Fitch Algorithm ; and Entropy2 are used as features for training. (d) ROC curve of evolutionary parameters used in (c) with the addition of Seq-GNM.
We also compared the performance of Seq-GNM with common in silico prediction tools like Polymorphism Phenotyping v2 (PolyPhen-2), and Sorting Intolerant from Tolerant (SIFT) [6,58]. The accuracy, sensitivity, and selectivity of disease predictions for nSNVs with experimental B-factors, B-factors from SIFT, PolyPhen-2, evolutionary parameters, and Seq-GNM are tabulated in Table 1. The accuracy of Seq-GNM using both EC values from EVcouplings and RaptorX is ~0.70. This accuracy is similar to using experimental B-factors for prediction (0.69) and also very close to prediction with evolutionary parameters (0.75), suggesting that Seq-GNM allows us to incorporate protein dynamics in nSNV predictions when the 3D experimental structures are not available. Moreover, accuracy of Seq-GNM approach is greater than SIFT (0.65) and PolyPhen-2 (0.64). Interestingly, Seq-GNM obtained by EVcouplings and RaptorX yields similar accuracies indicating that evolutionary couplings without the inclusion of structure could be utilized to predict B-factors to include as a feature to in silico prediction tools. Seq-GNM sensitivity (~0.90) surpasses other methods (0.80 for SIFT, 0.63 for PolyPhen-2, and 0.85 for evolutionary parameters), but it has a shortcoming in selectivity (~0.36) as other methods reach higher (~0.59). Conversely, training Seq-GNM combined with evolutionary parameters enhances the selectivity (0.66) to its highest value compared to others. Seq-GNM with evolutionary parameters predicted disease related nSNVs with accuracy 0.78 and sensitivity of 0.84, reaching beyond predictions of other metrics solely. These results suggest the incorporation of Seq-GNM with other prediction metrics can augment accuracy, sensitivity, and selectivity of prediction.
Prediction accuracy of Seq-GNM is further tested using 323 nSNVs (187 disease-associated, 136 neutral) of 22 proteins where their 3D experimental structures are not available (S2 Table). We used the trained classifier model of Seq-GNM B-factors for this test. While the B-factors obtained solely from Seq-GNM are used, it reached an accuracy, sensitivity, and selectivity of 0.82, 0.82, 0.83, respectively. This result further suggests that Seq-GNM allows us to incorporate protein dynamics as additional feature in in silico prediction tools without a known 3D structure.