Feature selection of gene expression data for Cancer classification using double RBF-kernels

Bioinformatics

Gene expression data can reflect gene activities and physiological status in a biological system at the transcriptome level. Gene expression data typically includes small samples but with high dimensions and noise [1]. A single gene chip or next generation sequencing technology can detect at least tens of thousands of genes for one sample, but when it comes to some diseases or biological processes, only a few groups of genes are related [2, 3]. Moreover, testing these redundant genes not only demands tremendous search space but also reduces the performance of data mining due to the overfitting problem. Thus, extracting the disease-mediated genes from the original gene expression data has been a major problem for medicine. Moreover, the identification of appropriate disease-related genes will allow the design of relevant therapeutic treatments [4, 5].

So far, several feature selection methods have been suggested to extract disease-mediated genes [68]. Zhou et al. [3] proposed a new measure, LS bound measure, to address numerous redundant genes. Several statistical theories (χ2et al.) and classic classifiers (Support Vector Machine et al.) have been used in feature selection [9]. In general, these methods can be divided into three categories: filter, wrapper and embedded methods [9, 10]. The filter method is based on the structural information of the dataset itself, which is independent of the classifier, and it selects a feature subset from the original dataset using a certain evaluation rule based on statistical methods [11]. The wrapper method [12] is based on the performance of the classifier to evaluate the significance of feature subsets, while the embedded method [13] combines the advantage of filter and wrapper methods, selecting feature genes using a pre-determined classification algorithm [14, 15]. Since the filter methods are independent of the classifier, the computational complexity of these methods is relatively low, hence, they are suitable for massive data processing [16]. Yet, wrapper methods can reach a higher accuracy, but they also have a higher risk of over-fitting.

Kernel methods have been one of the central methods in machine learning in recent years. They have widely been applied to the area of classification and regression. A kernel method has the capability of mapping the data (non-linearly) to a higher dimensional space [17]. Hence, by using the kernel method, the dimension of the observed data such as gene expression data can be significantly reduced, that is, the irrelevant genes can be filtered by kernel method, thus revealing the hidden inherent law in the biological system [18]. Characteristically, kernels have a great impact on learning and predictive results of machine learning methods [5, 19].

Although a great number of kernels exist and it is intricate to explain their distinctive characteristics, kernels used by feature extraction can be divided into two classes: global and local kernels, such as polynomial and radial basis function (RBF) kernels. The influence of different types of kernels on the interpolation and extrapolation capabilities has been investigated. In global kernels, data points far away from the test point have a profound effect on kernel values, while, by using local kernels, only those close to the test point have a great effect on kernel values. The polynomial kernel shows better extrapolation abilities at lower orders of the degrees, but requires higher orders of degrees for good interpolation, while the RBF-kernel has good interpolation abilities, but fails to provide longer range extrapolation [17, 20].

KBCGS [20] is a new filter method based on the RBF-kernel using weighted gene measures in clustering. This supervised learning algorithm applied global adaptive distance to avoid falling in local minima. The RBF kernel function has been proven useful when it comes to show a satisfactory global classification performance for gene selection. Yet, exploring this problem in depth definitely needs further research. A typical mixture kernel is to construct a convex combination of basis kernels. Based on the characteristics of the original kernel function, linear fusion of a local kernel function and a global kernel function can constitutes a new mixed kernel function. Several mixture kernels have been introduced in [2123] to overcome limitations of single-kernel, which can enhance the interpretability of the decision, function and improve performance. Phienthrakul et al. proposed Multi-scale RBF Kernels in Support Vector Machines and demonstrated that the use of Multi-scale RBF Kernels could result in better performance than that of a single RBF on benchmarks [23].

In this paper, we modified KBCGS based on double RBF-kernels, and applied the proposed method to feature selection of gene expression. We introduced the double RBF-kernel to both SVM and KNN, and evaluated their performance in the area of gene selection. This mixture describes varying degrees of local and global characteristics of kernels only by choosing different values of γ1and γ2. We combined the double RBF-kernel with a weighted method to overcome the limitations of single and local kernel. As an application, we provided a feature extraction method which uses this kernel, applying our method to several benchmark datasets: diffuse large B-cell lymphoma (DCBL) [24], colon [2], lymphoma [1], gastric cancer [25], and mixed tumors [26] to evaluate its performance. The results demonstrate that this method allows better discrimination in gene selection. In addition, the method is superior when it comes to accuracy and efficiency if we compare this technique with traditional gene selection methods.

This paper provides a brief overview of the gene selection method for expression data analysis, then, the improved KBCGS method called DKBCGS (Double-kernel KBCGS), in which the two classification methods were used for the clustering analysis was compared to six popular gene selection methods. The last section of the paper provides a comprehensive evaluation of the proposed method using four benchmark gene expression datasets.

Articles You May Like

The brightest known galaxy in the Universe has turned out to have a really dirty secret
From Pine Cones to Hobbit Holes, Mimicking Nature Can Help Humans Adapt to Wildfires
Are You a Coffee or Tea Person? It Could Be Down to Your Genetics
There’s a Secret Meaning Behind The Devil’s Number 666
Super Storms And Hurricanes Like Katrina, Maria And Harvey Will Get Even Worse (Here’s How Much)

Leave a Reply

Your email address will not be published. Required fields are marked *