Antimicrobial peptide similarity and classification through rough set theory using physicochemical boundaries

Bioinformatics
Protein and peptide sequence-based classification methods have been extensively developed to improve the understanding of the functionality of polypeptides [65, 66]. By using rough set theory, our method builds rules that distinguish between active antibacterial peptides from inactive antibacterial peptides. The developed method was benchmarked against methods including a recently published method EFC [52], based on motif-recognition, as well as against a larger set of methods from publicly available prediction servers. The first benchmark test is a ten-fold cross validation on a dataset used in previous studies [52, 64] with the positive sequences clustered from the APD2 (Antimicrobial Peptide Database 2) [10] to 115 clusters and the negative sequences from the PDB [67] clustered to 116 clusters. Each cluster is represented by one sequence. The results were compared with EFC-based methods and support vector machines given subsequences of lengths 5 to 8 amino acids. Table 3 demonstrates that our method has high selectivity and accuracy in comparison to the performance of the SVM methods, and comparable selectivity and accuracy in comparison to the EFC method. A trend of decreasing Mathew’s Correlation Coefficient (0 for random guessing and 1 for perfect performance) as the length of the subsequence increases is seen in Table 3. Our subsequences in CLN-MLEM2 are 3 amino acids long and may have helped to contribute to our improved performance for using a single length of subsequences instead of combining four different lengths in the EFC method.

Table 3

Performance of rough set theory rule induction compared to motif-search in 10-fold cross validation

5-kmer SVM

75.7

75.0

0.54

6-kmer SVM

74.8

74.1

0.46

7-kmer SVM

73.0

72.4

0.40

8-kmer SVM

73.0

72.4

0.36

EFC-FCBF

87.1

87.2

0.76

CLN-MLEM2

86.9

86.3

0.75

We further tested our modified MLEM2 method against a larger variety of classification methods. The second benchmarking test uses the iAMP-2 L dataset [50]. Like the dataset used for the first benchmark, this dataset is derived from the APD2 database. However, instead of choosing a single sequence from each cluster, the sequences were narrowed by removing sequences with greater than 40% similarity as measured by CD-HIT [68] only with cluster of more than 250 sequences. This resulted in a testing positive dataset of 848 unique sequences. The negative sequences were from a UniProt search of cytoplasmic proteins, also with less than 40% similarity. 2405 unique sequences were included in the negative dataset. The positive training data set was the S1 set (“Antibacterial”) from iAMP-2 L, which has 1274 unique sequences. The negative training set of data was the non-AMP data set from iAMP-2 L, which has 1440 unique sequences.

While our method has comparable selectivity in classification to current state-of-the-art method, our method is among the best in specificity (Table 4). The combination evolutionary algorithm with chemical properties (EFC + 307-FCBF: EFC combined with FCBF (Fast Correlation Based Features) using 307 features) is the only other state-of-the-art method with specificity that is comparable to ours. We achieve similar specificity using 74 AAindex1 features instead of 307 AAindex1 features. Removing the length-independent representation from the EFC method (EFC-FCBF: EFC without FCBF) results in almost no loss of sensitivity, but a loss of 6% in selectivity. Removing the order-sensitive representation for EFC in Table 2 results in lower sensitivity and selectivity performance (MCC = 0.54). While the datasets are different, between Table 3 and Table 4 results, the difference in the individual components of the EFC algorithm compared to the combined algorithm shows a dramatic improvement when integrating order-sensitive and length independent sequence representations. Our CLN-MLEM2 method integrates these two types of representations at its most basic level of output, the rule.

Table 4

Performance comparison among prediction servers for antimicrobial peptides, a motif-based classification method and rough set theory approach

CAMP SVM

95.8

39.8

0.43

CAMP RF

97.1

33.5

0.40

CAMP ANN

89.1

70.9

0.61

CAMP DA

94.1

49.5

0.49

iAMP-2 L

97.7

92.0

0.90

EFC-FCBF

92.0

90.0

0.73

EFC + 307-FCBF

(307 AAindex1 features)

92.4

96.1

0.86

CLN-MLEM2

(74 AAindex1 features)

88.0

95.4

0.85

Our method has high specificity and similar accuracy for antibacterial classification as other current methods. When using a classification method for the discovery of antimicrobial peptides, the specificity of the method is more important than its selectivity [69]. Our method prioritizes specificity with low false discovery rate (FDR) by classifying sequences that do not meet any rule in the applied rule set as inactive (Fig. 3). In fact, there is only one method, which provides lower FDR compared to our method, i.e. EFC + 307-FCBF. However, our method results in similar specificity starting with fewer physicochemical properties. The robustness of this method may be potentially improved with ensemble learning and voting scheme approaches. If our method provides unique descriptions of activity, then it will reduce the overall false discovery rate of the ensemble method and voting scheme approaches.

Fig. 3

False discovery rates of comparative antimicrobial peptide classification methods. CLN-MLEM2 achieves a low false discovery rate among currently available antimicrobial peptide classification methods

CLN-MLEM2 has been shown to be useful for the learning task of predicting antibacterial activity from a peptide sequence. This learning task is related to multi-instance learning. A classic literature example of a multi-instance learning problem is in drug activity prediction [70]. Active molecules have at least one conformation that interacts with a drug target, while inactive molecules have none. The challenge is to identify which conformations interact with the drug target. Each drug has one molecular formula, but it can have many conformations. Each peptide also has one sequence but many physicochemical property values. The CLN-MLEM2 method has found the most relevant physicochemical property features that relate to the activity of the peptide sequence. This CLN-MLEM2 method can also be applied to the multi-instance learning case of describing the conformations of peptides are active.

Our method also acts as an embedded feature selection tool by limiting the physicochemical properties in the rules to a user-defined number [62]. This embedded feature selection property may make CLN-MLEM2 useful for feature selection for other methods in the field, with the capability of setting the limit of the number of features to select. Our proposed method, CLN-MLEM2 has a low false discovery rate compared to comparative antimicrobial peptide methods as shown in Fig. 3. EFC method also has a low false discovery rate when including the physicochemical properties, but a doubled false discovery rate when the pattern recognition component is used alone.

A decrease in selectivity of the classification will cause longer computer search times, while a decrease in specificity will increase the number of necessary experimental activity assays. Since the cost of experimentally testing peptides is much greater than the computational time of searching for antimicrobial peptides, methods that have high specificity are preferred. In addition to the high specificity of our method, our method creates categories of antimicrobial peptides. Categorization of peptides aids in the selection and in the design of antimicrobial peptides by providing similarity groupings according to physicochemical property boundaries. Peptides that match multiple active categories can combine more physicochemical property values associated with activity.

Articles You May Like

Why Australia’s new encryption laws may actually help criminals
There’s one big problem with that ocean garbage collector we got so excited about
Engineering a bispecific antibody with a common light chain: Identification and optimization of an anti-CD3 epsilon and anti-GPC3 bispecific antibody, ERY974
CDC: Raw Cookie Dough? Why You Should Just Say No
Why Cancer Treatment Can Differ For Those Who Aren’t Married

Leave a Reply

Your email address will not be published. Required fields are marked *