CoExpresso: assess the quantitative behavior of protein complexes in human cells

Bioinformatics

Tight regulation of protein complexes by translational and post-translational control mechanisms may result in the degradation of more abundant proteins that do not form the complex. Then the proteins of known complexes, such as the ones collected in the CORUM database, will show similar abundance protein profiles when compared across different cell types. Given that proteins often have multiple functions, a protein complex might present itself in different compositions or the complex does not change in abundance, we did not assume all complexes to show highly similar abundance profiles of their proteins but merely investigated how much co-regulation can be observed.

We applied different scoring systems to evaluate whether proteins in human complexes exhibit similar regulatory behavior when compared over multiple cell types. Despite of having a large set of available protein abundances, coverage of the proteins over the 140 cell types was often sparse (Additional file 1: Figure S1), requiring scoring methods that account for missingness. In such a scenario, just calculating the similarity between protein abundance profiles, e.g. by calculating Pearson’s correlation, will not provide statistically valid measures for their co-regulatory behavior. For instance, low coverage over cell types leads automatically to higher correlations between protein abundance profiles than for higher coverage (Additional file 1: Figure S2). Hence, confidence estimations of protein co-regulation require adapting the scoring methods to include effects coming from data coverage. This can be achieved by empirically calculating p-values from comparison of the score of a protein to the scores obtained from appropriately randomized data. Therefore, we investigated different ways of randomizing the ProteomicsDB data to identify the best performing combination of scoring scheme and randomization procedure.

Table 1 summarizes the used methods and randomizations. In short, MCOM compares each protein profile versus the averaged profile of protein group, allowing to assess how much a protein follows this common trend. PCOM is based on pairwise comparisons and summarizes them by their sum. This method was implemented to consider internal structures of protein subgroups with high correlations. The FAM model is based on factor analysis and calculates weights for each protein, giving a measure of how much each protein contributes to the profile of the entire protein group.

Table 1

Summary of scoring models and randomization methods

Model

Abbrv.

Output

Mean correlation

MCOM

Similarity to averaged abundance profile

Pairwise correlation

PCOM

Sum of pairwise similarities

Factor analysis

FAM

Weights for protein contribution to full set

Randomization

Abbrv.

Basis

Independent sampling

IS

Mix all values

Protein-centered sampling

PCS

Keep protein profiles

Protein- and cell type- centered sampling

PTCS

Keep protein and cell type profiles

For the following analysis, each protein complex reported in CORUM was assessed for coverage in ProteomicsDB and further evaluated by the different models when all protein subunits were available in at least 5 cell types. We tested a total of 1414 protein groups out of 2157 annotated in CORUM.

Scoring models

Empirical confidence estimation of co-regulatory behavior was carried out by representing the null distribution (i.e. cases of no co-regulation) by scores obtained from randomizations. By comparing the scores of the different models to scores from randomly sampled data, we obtained probabilities to discard the observed abundance profiles as result of randomly chosen proteins. Thus the false discovery rates (FDRs), represented by p-values corrected for multiple testing, provide a measure for significance of a given complex on basis of co-regulation of its subunits within human cell types. The different randomization techniques were applied to resemble the intrinsic data structure on different scales.

Figure 2a compares the p-values calculated for each model and randomization. More “realistic” randomization (IS PCS1: Figure S3). The FAMS approach however performed differently, reaching a higher number of significant complexes for the protein-centered randomization

Fig. 2

Comparison of models for significant co-regulations. Number of complexes (a) and proteins (b) with significant abundance profiles according to the different scoring models and randomizations calculated for different thresholds for their false discovery rate (FDR)

On protein level (Fig. 2b), lower protein numbers with significant abundance profiles could be expected and were observed when using randomization methods that maintain protein and cell type properties. Here, PCOM displays a higher number of proteins than FAMS and MCOM for low false discovery rates.

Robustness

Recovery of proteins and complexes with significant abundance profiles does however not ensure robustness of the methods towards noise. As example, one could expect a protein complex to contain subunits that do not follow the general trend of the abundance profiles. This could be due to wrong assignment of a protein to a complex or due to different behavior of a subunit being heavily regulated by e.g. post-translational modifications or by forming transients regulating complex function.

Method robustness in handling differentially abundant proteins can be simulated by adding randomly chosen proteins to the CORUM complexes. In all complexes, we increased the number of proteins by 50%, 75% and 100%. Figure 3 shows ROC curves for these simulated complexes, where we compared the significance by counting true (actual complex subunits) and false positives (added proteins). Here, the different methods and randomization approaches showed consistent differences for their robustness. Randomization of the entire ProteomicsDB data lead to lower robustness for all methods. One the other hand, protein-centered (PCS) and protein-cell type centered (PTCS) randomization gave nearly identical performance results. Hence, the following analysis will focus on PCS randomization, although being the computationally mosts expensive one, as it yields higher counts of significant proteins. In addition, MCOM and FAM models had lower false positives rates at least in the lower range.

Fig. 3

Performance of scoring models measured by robustness to 50%, 75% and 100% artificially added random proteins. Proteins were categorized into complex subunits and random proteins. True positive and false positives rates (TPR and FPR) were given by the fraction of true positives and false positives at a given FDR threshold. MCOM and FAM models lead to better performance. Only slight difference between PCS and PTCS randomizations can be observed

Use cases

The following use cases will provide detailed results of the scoring models and general complex behavior for three selected complexes that are representative for the investigated complexes. We obtained 60 CORUM complexes with lowest FDR values (<0.0003) for all three scoring models.

Use case A: Condensin I (Fig. 4a) represented the first of the complexes with lowest FDR values in all models (PCS randomization). All five proteins were commonly expressed in 75 cell types. Very high correlations between all proteins confirmed the high interaction evidence from STRINGdb [17]. However, Condensin subunit 2 (NCAPH) showed slightly lower correlation and lower scores. Indeed, NCAPH is known as regulatory subunit of Condensin I with different nucleolar localization during interphase [18]. We observed different abundance levels of NCAPH in several cell types leading to lower weight by factor analysis (Additional file 1: Figures S4 and S5A). Tissues with 2-fold lower abundance levels (compared to the mean of all proteins of the complex) were blood platelet and lung while 2-fold higher abundance levels where measured for lymph nodes and several cancer cell lines.

Fig. 4

Examples for complexes with highly co-regulated proteins. Upper panels: hierarchical clustering of pairwise correlations between protein abundance profiles. The sidebars show the significance of MCOM and FAM models (PCS randomization). Middle panels: Network visualization of profile similarities. Edge widths correspond to pairwise correlations. Grey tones of the proteins depict FDR significance calculated by PCOM (PCS randomization). Lower panels: STRINGdb (version 10) networks of proteins. Edge width is given by interaction confidence

Use case B: 28S mitochondrial ribosomal subunit (Fig. 4b), being essential for ATP production, represents the complex with lowest FDR in all models and most proteins. The 30 proteins were commonly available in 23 cell types. Both our visualization and STRINGdb interactions suggest a more open structure or composition of the complex with a core component of heavily co-regulated proteins. The correlation map (upper figure) roughly distinguishes two slightly overlapping large subgroups (proteins MRPS22-MRPS2 and MRPS15-MRPS12) with higher correlations amongst their proteins. We found a strikingly different behavior of these groups in lung tissue (Additional file 1: Figure S5B). This suggests that the 28S ribosomal complex plays a different role in lung where it might break up into two functional units.

We looked into more examples of subgroups with highly co-regulated abundances. Higher coverage over cell types for these subgroups allows gathering further insight into their co-regulation. MRPS17, MRPS36 and MRPS12 show very low correlation (Additional file 1: Figure S6A) and this is confirmed when estimating their significance over 48 cell types (individual proteins p>0.05 and all overall scores p>0.1). MRPS12 and MRPS12 are known to be altered in many and different cancer types [19], which could explain their particularly different behavior.

A group of the five proteins MRPS21, MRPS24, MRPS26, MRPS6 and MRPS33 exhibited highest correlations and reasonably high significance. We investigated their co-regulation as a protein group on their own where their abundance profiles were available in a higher number of 34 cell types (Additional file 1: Figure S6B) and confirmed highly significant co-regulation. A literature search did not identify any functional behavior for this protein subgroup.

Moreover, the correlation map of 28S mitochondrial ribosomal subunit exihibits a large subgroup of proteins (DAP3, MRPS2, MRPS5, MRPS7, MRPS10, MRPS14,MRPS15, MRPS18B, MRPS22, MRPS23, MRPS27, MRPS28, MRPS34 and MRPS35) with high correlations (Additional file 1: Figure S6C). When investigating these proteins as subgroup, all proteins but MRPS15 showed high significance for co-regulation. All of them were consistently lower abundant in lung tissue when compared to the other proteins of the complex. This confirms that this subgroup might play a particular role in lung tissue.

Use case C: NUMAC complex (nucleosomal methylation activator complex, Fig. 4) denotes a case with slightly lower significance. All scoring models suggest high significance with an FDR below 0.5%. The 10 proteins were found in 33 cell types with ACTB distinct behavior and drastically higher abundance than the other proteins. Strong evidence for interactions of all components but SCYL1 in STRINGdb suggests that ACTB plays a crucial role in complex composition but might still have other functions in the cell. We assume that this protein is not actively degraded when not forming the complex. All 3 models agreed in having high FDR values for ACTB and SMARCD1 (FDR >0.1), suggesting that the latter plays a particular role in this complex.

A data source for tightly co-regulated proteins

Given the strong co-regulation in annotated protein complexes, we asked whether our randomly sampled protein groups with highly significant co-regulation could determine novel but yet not well characterized complex compositions in human cells. Random protein groups with the highest scores did however not provide evidence for these proteins to be arranged as complexes but showed an increase in protein interactions. We calculated network enrichment scores in STRINGdb for the top scoring 100 protein groups and found the majority to be consistently higher than for randomly chosen protein groups (Additional file 1: Figures S7-S8). This means that highly significant protein groups do potentially have particular common biological functions such as co-regulation on transcriptional level or being represented by common members of a known or unknown pathway. We implemented CoExpresso that interrogates groups of human proteins to assess their co-regulation strength. Therefore, our CoExpresso web service can be highly useful for the interested researcher to test their hypothesis on the basis of human cell types in general. Figures and statistical measures can be obtained for any list of (mixed) human protein accession numbers and gene names given that there is sufficient data coverage in ProteomicsDB.

Articles You May Like

Earliest known Mariner’s Astrolabe research
Are there Zika reservoirs in the Americas?
Novel electrocardiogram uses signals from ear and hand to check heart rhythm
Measuring differences in brain chemicals in people with mild memory problems
Sweat holds most promise for noninvasive testing

Leave a Reply

Your email address will not be published. Required fields are marked *