Proteins that are predicted to be expressed from an open reading frame, but for which there is no experimental evidence of translation are known as hypothetical proteins (HPs). Across the whole genome, approximately 2% of the genes code for proteins, while the remaining are non-coding or still functionally unknown . These known-unknown regions for which no functional links are discovered, i.e. those with no biochemical properties or obvious relatives in protein and nucleic acid databases are known as orphan genes, and the end products are called HPs . These proteins are of great importance, as many of them might be associated with human diseases, thus falling into functional families. Despite their lack of functional characterization, they play an important role in understanding biochemical and physiological pathways; for example, in finding new structures and functions , markers and pharmacological targets  and early detection and benefits for proteomic and genomic research . In the recent past, many efficient approaches have existed and the tools are publicly available to predict the function of the HPs. One such widely used technique is protein-protein interaction (PPI) analyses, which is considered valuable in interpreting the function of HPs . While many proteins often interact with other proteins towards expediting their functions, there are challenges that are not just limited to their function but also to their regulation . Therefore, characterizing the uncharacterized proteins helps to understand the biological architecture of the cell . While high-throughput experimental methods like the yeast two-hybrid (Y2H) method and mass spectrometry are available to discern the function of proteins, the datasets generated by these methods tend to be incomplete and generate false positives . Along with PPIs, there are other methods to identify the essentiality of proteins, such as antisense RNA , RNA interference , single-gene deletions  and transposon mutagenesis . However, all these approaches are tedious, expensive and laborious; therefore, computational approaches combined with high-throughput experimental datasets are required to identify the function of proteins [9, 14]. Different computational methods have been designed for estimating protein function based on the information generated from sequence similarity, subcellular localization, phylogenetic profiles, mRNA expression profiles, homology modelling etc. . Very recently, Lei et al. predicted essential proteins based on RNA-Seq, subcellular localization and GO annotation datasets [16, 17]. Furthermore, tools such as “LOCALIZER” , that predicts subcellular localization of both plant and effector proteins in the plant cell, and IncLocator  have been useful in predicting subcellular localization for long non-coding RNAs based on stacked ensemble classifiers . On the other hand, combined analysis of all these methods or datasets is considered to be more predictive in integrating heterogeneous biological datasets . Genome-wide expression analysis, machine learning, data mining, deep learning and Markov random fields are the other prediction methods which are widely employed [20, 21], whereas Support Vector Machines (SVM) , Neural Networks , Bayesian Networks [24, 25], Probabilistic Decision Trees , Rosetta Stone [14, 27], Gene Clustering and Network Neighbourhood analyses  have been used to combine different biological data sources to interpret biological relationships. Although these have shown to be successful in predicting protein function, annotation based on feature selection for inferring the function of HPs is wanting. Nevertheless, there has been a steady increase in the use of imparting machine learning and information theoretic features used for development of efficient framework for predicting interactions between proteins [28–30].
In this paper, we present a machine learning based approach to predict whether or not the given HP is functional. This method is not based on homology comparison to experimentally verified essential genes, but depends on the sequence-, topological- and Structure-based features that correlate with protein essentiality at the gene level. Features are the observable quantities that are given as input to a machine learning algorithm. Data given across each feature is used by the learning algorithm to predict the output variables. Therefore, selecting the relevant features that could predict the desired outputs is important. There are various features that define the essentiality of the proteins. In our previous study , we selected six such features (orthology mapping, back-to-back orthology, domain analysis, sorting signals and sub-cellular localization, functional linkages, and protein interactions) that are potentially viable to predict the function of HPs. Although the prediction performance of the selected features was shown to be acceptable, in this present study we added data on pseudogenes, non-coding RNA and homology modelling to increase the predictability of functionality of these known-unknowns. The additional features which we employed are extended to show the possibility of pseudogenes linked to HPs, proteins that are essentially structural ‘mers’ of the candidate proteins and presence of non-coding RNA signatures. We discuss the performance of newly introduced classification features from a machine learning perspective to validate the function of HPs.