Optimizing Feature Subset and Parameters for Support Vector Machine Using Multiobjective Genetic Algorithm

Jyoti Ahuja; Saroj Ratnoo

doi:10.1515/jisys-2014-0107

Open Access Published by De Gruyter August 27, 2014

Optimizing Feature Subset and Parameters for Support Vector Machine Using Multiobjective Genetic Algorithm

Jyoti Ahuja and Saroj Ratnoo

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2014-0107

Abstract

The well-known classifier support vector machine has many parameters associated with its various kernel functions. The radial basis function kernel, being the most preferred kernel, has two parameters (namely, regularization parameter C and γ) to be optimized. The problem of optimizing these parameter values is called model selection in the literature, and its results strongly influence the performance of the classifier. Another factor that affects the classification performance of a classifier is the feature subset. Both these factors are interdependent and must be dealt with simultaneously. Following the multiobjective definition of feature selection, we have applied a multiobjective genetic algorithm (MOGA), NSGA II, to optimize the feature subset and model parameters simultaneously. Comparison of the proposed approach with the grid algorithm and GA-based method suggests that the MOGA-based approach performs better than the grid algorithm and is as good as the GA-based approach. Moreover, it provides multiple solutions instead of a single solution. The users can prefer one feature subset over the other as per their requirement and available resources.

Keywords: Support vector machine; multiobjective GA; model selection

1 Introduction

Support vector machine (SVM) is a promising method for classification owing to its high accuracy, good generalization capability, ability to deal with high-dimensional data, and consistency in modeling diverse data sets [17]. Yet, it requires an understanding of various ways that influence their predictive accuracy. Choosing an optimal feature subset is one of the important tasks that have direct impact on the classification accuracy of SVM. While preparing a classification model using SVM, the other important issue to be considered is to set the best kernel parameters. Both issues are interrelated, i.e., the choice of kernel parameters depends on the feature subset used and vice versa, and thus must be dealt with simultaneously.

Many feature selection methods with different evaluation criteria exist in the literature. However, optimization of a feature subset with respect to a single criterion is not sufficient [18]. Most often, the accuracy of a classifier is the most important criteria to judge a feature subset; however, the size of a feature subset is also a main concern when dealing with exceptionally high-dimensional data sets. Moreover, usually one feature subset is not of much interest; rather, several subsets of features are of interest. The choice of the final feature subset may depend on various factors such as feature measurement cost, particularly in fields like medical diagnostics. Of two feature subsets resulting in almost the same classification accuracy, one with the least cost may be desired. Thus, feature selection is an inherently multiobjective problem. As multiobjective genetic algorithms (MOGAs) produce multiple optimal solutions and provide users a broader choice for feature subset selection, they have been posed as the best choice to solve feature selection as a multiobjective optimization problem (MOP) dealing with two competing objectives, i.e., maximization of classification accuracy and minimization of the cardinality of the feature subset [8, 16]. In case of cost-sensitive applications, the final choice of the solution is left to the user.

Another crucial step for building an efficient classification model using SVM is tuning of parameters associated with its various kernel functions. Furthermore, the effectiveness of SVM depends on the selection of kernel, kernel parameters, and soft margin parameter C. In general, Gaussian kernel (also called radial basis function (RBF) kernel) is a reasonable first choice because of its better accuracy and less convergence time [1]. Two parameter values have to be chosen carefully while using the RBF kernel. These concern, respectively, the regularization parameter (usually denoted as C), which sets the trade-off cost between the training error and complexity of the model and the kernel function parameter, gamma (γ). The problem of choosing these parameter values is called model selection, and its results strongly influence the performance of the classifier. Model selection is also an optimization problem and requires some meta-heuristics such as genetic algorithm (GA) to deal with.

We can optimize the feature subset and parameters of SVM simultaneously using GAs. Taking into account the multiobjective definition of feature selection problem, we have applied the most referenced MOGA introduced by Deb et al. [3], i.e., NSGA II, to discover multiple and conflicting feature subsets and SVM parameters associated with each feature subset simultaneously.

The rest of the article is organized as follows: Section 2 reviews the literature related to the problem of feature selection and model selection. Sections 3 and 4 give a brief introduction of MOGAs and the SVM classifier. The proposed method is described in Section 5. Section 6 presents the experimental design and a discussion of results. Section 7 summarizes the proposed method and suggests the possible extension of the current work.

2 Related Work

GAs, less likely to be restricted by interdependencies among features, are most frequently used for resolving feature selection problems. They are popular among researchers owing to their simplicity and capability to search through the exponential search spaces. A number of GA-based approaches have been proposed to optimize feature subsets, which provide efficient exploration of the solution space to give the single best solution with maximum classification performance [5, 10, 12, 18]. Subsequently, considering feature selection as a MOP, MOGAs have been successfully applied for feature subset selection. A survey of multiobjective evolutionary algorithms in feature selection has been provided in Reference [15] by Mukhopadhyay et al. The review concludes that the evaluation of a feature subset with respect to a single criterion does not work equally well for all data sets. There is a need to optimize feature subsets with respect to multiple criteria to improve the robustness of a feature subset. Various optimization criteria have been used in the literature to deal with feature selection problems. Hamdani et al. [8] have employed MOGA to minimize the number of features and classification error rate simultaneously. Emmanouilidis et al. [6] applied MOGA with the same objectives and extended the approach to neural network feature selection by introducing a novel commonality-based crossover operator to induce exploratory strength in the algorithm across a range on non-dominated fronts. Minimization of error rate and complexity of the discovered knowledge have been used as optimization criteria by Pappa et al. [16]. This work claims to have improved the comprehensibility of the discovered classifier without compromising on error rate. Some authors have also worked with filter criteria as objective functions of multiobjective feature selection. Six importance measures (two at a time) based on a filter approach have been analyzed in Reference [21]. The authors of this article conclude that MOGA is a useful tool for selecting features owing to the effective optimization power of GAs and the ability of multiobjective optimization to investigate multiple solutions at once. Mukhopadhyay and Maulik have proposed an SVM wrapped algorithm based on multiobjective evolutionary algorithms to identify micro-RNA (miRNA) markers from miRNA expression data sets, which optimizes different performance criteria simultaneously to obtain the final best feature subset [14]. Wang and Huang [22] have successfully applied the MOGA-based feature selection method in selecting features from credit approval data.

Same as the feature selection problem, a lot of work has been done for SVM parameter optimization. SVM model selection was tackled using grid search algorithm for a long time. It explores the parameter search space with a fixed step size through a wide range of values in an exhaustive way. It is time consuming and does not perform very well. Evolutionary algorithms have also been used for SVM model selection. In References [7, 23], SVM parameters have been optimized by using GAs, which improved the results achieved through the grid search algorithm. As GAs have the capability to optimize the feature subset and SVM parameters at the same time, the authors have also worked out the problem of feature selection and parameter optimization simultaneously in an evolutionary way in References [2, 9, 24], which outperforms the grid search method. However, their approach does not take into account the multiobjective definition of a feature selection problem. Thus, to carry out multiobjective feature selection and parameter optimization simultaneously, we have applied a non-dominated sorting GA (NSGA II).

3 Multiobjective Genetic Algorithms

Most real-life problems involve simultaneous optimization of various objectives that are incommensurable and often conflicting. In a MOP, it is hard to find a solution that is best with respect to all objectives [25]. Thus, there is a need to generate a set of solutions, each of which is good enough to satisfy all the objectives at some adequate level without being dominated by any other solution in the solution space, called non-dominated or pareto-optimal solutions [11].

If all the objective functions are for maximization, a feasible solution x is said to dominate another feasible solution y (x> y), if f_i(x )≥ f_i(y) for i=1, 2, …, k (k is the number of objectives) and f_j(x)> f_j(y) for at least one objective function j.

A solution is said to be Pareto optimal if it is not dominated by any other solution in the solution space. The set of all possible non-dominated solutions in the solution space is referred to as the Pareto optimal set, and the corresponding objective function values in the objective space are called Pareto fronts. Thus, the eventual task of multiobjective optimization is to recognize solutions in the Pareto optimal set.

GAs, being a population-based approach, are appropriate to solve MOP [4]. A single-objective GA can be modified to find a set of multiple non-dominated solutions without being run multiple times. Moreover, it is possible to find diverse set of solutions using GA because of its ability to search different regions of a solution space simultaneously. Therefore, GAs have been the most popular heuristic approach to multiobjective design and optimization problems [19].

4 Support Vector Machine

The SVM is a well-known classifier introduced by Vapnik and coworkers in 1992. It is an eminent supervised learning technique capable of solving linear and non-linear binary classification problems. Given a training set with n instances {(xi, yi)}i = 1n, where x_i∈X⊆ Rⁿ is an input vector and y_i∈{–1, +1} its corresponding binary class label, the target of SVM is to separate the instances/examples by the means of maximal marginal hyperplane (MMH). SVM finds this hyperplane using (i) support vectors that are the subset of training examples and (ii) the sides of MMH that are called margins. Therefore, the algorithm strives to maximize the distance between examples that are closest to the MMH. The margin of separation is related to Vapnik-Chervonenkis dimension (VCdim), which measures the complexity of the classifier.

SVM provides a specific mechanism that fits the hyperplane surface to the training data using a kernel function in case the data are not linearly separable. In the following sections, we describe the mathematical derivation of SVM in three cases, a linear classifier for a linearly separable problem. Then, we explain linear classifiers for linearly non-separable problems, and finally, a non-linear classifier for linearly non-separable problems.

4.1 When Data Are Linearly Separable

For the linearly separable case, SVM determines the hyperplane by maximizing the sum of its distance to the two sides of margins. An equation of separating hyperplane can be written as w.x+ b=0, where w is a weight vector associated with every attribute, namely, w={w₁, w₂, …, w_n}, and b is a scalar referred to as a bias. For a linearly separable case, the data points will be correctly classified by the following equations:

(1)wT.xi + b ≥ 0 if yi = +1, (1)

(2)wT.xi + b ≤ 0 if yi = −1. (2)

Equations (1) and (2) can be combined into one set of inequalities given as follows:

(3)yi(wT.xi + b) ≥ 1. (3)

Those examples that satisfy eq. (3) with equality are called support vectors. Figure 1 shows an example of a hyperplane for a linearly separable case.

Figure 1

Linear Classifier.

The SVM finds the MMH by solving the following quadratic optimization problem:

(4) min w, b 12 || w ||2 subject to: yi (wTxi + b) ≥ 1 , ∀ i. (4)

To solve the above quadratic optimization problem, one needs to find the saddle points of the Lagrange function:

(5)L(w, b, α) = 12 || w ||2 − ∑i = 1nαi[yi(wT.xi + b) − 1], (5)

where α_i≥0 is the vector of the Lagrange multiplier corresponding to the constraint associated with every training instance. By differentiating the above equation with respect to w and b, the following equations are obtained:

(6)∂L(w, b, α)∂w = w − ∑i = 1nαiyixi = 0, (6)

(7)∂L(w, b, α)∂b = ∑i = 1nαiyi = 0 . (7)

After applying Karush-Kuhn-Tucker conditions to the above equations, the decision function for the SVM classifier finally becomes

(8)F(x) = sign(wT.x + b) = sign(∑i ∈ svyiαi(xT.xi) + b). (8)

4.2 When Data Are Linearly Inseparable

If the data are not linearly separable, then perfect linear separation of data is not possible by means of linear SVM. Thus, in such cases, either we can extend the linear approach by tolerating a few misclassifications or we can apply non-linear SVM that maps the data into some higher-dimension feature space where the data become linearly separable. In the former case, a slack variable also called regularization parameter (C) is added to the function to be optimized. This variable sets a trade-off between a large margin and a misclassification error. Thus, the penalized optimization objective of classification can be rewritten as

(9)min w, b, ξ12 || w ||2 + C∑i = 1nξi subject to: yi(wTxi + b) ≥ 1 − ξi , ∀ i ξi ≥ 0 ∀ i, (9)

where C is a regularization constant (penalty parameter). The value of ξ_i indicates the distance of x_i with respect to the decision boundary.

ξ_i ≥ 1: x_i is misclassified.
0 < ξ_i < 1: x_i is correctly classified but lies inside the margin.
ξ_i = 0: x_i is correctly classified and lies outside the margin or on the margin boundary.

In the latter method, SVM maps the data into some higher-dimensional space. This mapping is performed by some mapping function, Φ.

Under this mapping, the optimization problem of eq. (9) becomes

(10)min w, b, ξ12 || w ||2 + C∑i = 1nξi subject to: yi(wTϕ(xi) + b) ≥ 1 − ξi , ∀ iξi ≥ 0 ∀ i. (10)

Now, the solution obtained in eq. (8) is modified as

(11)F(x) = sign(wT.x + b) = sign(∑i ∈ svyiαiϕ(xT).ϕ(.xi) + b). (11)

This mapping is performed by a kernel function such that K(x, y)= ϕ(x)·ϕ(y). Hence, we can rewrite the decision function given in eq. (11) as

(12)F(x) = sign(∑i ∈ SVyiαiK(x, xi) + b). (12)

There are four basic kernels [1], as shown in eqs. (13) to (16):

Linear:
(13)K(xi, xj) = xiTxj. (13)
Polynomial:
(14)K(xi, xj) = (gxiTxj + r)d, g > 0. (14)
RBF or Gaussian:
(15)K(xi, xj) = exp( − g ||xi − xj||2), g > 0. (15)
Sigmoid:
(16)K(xi, xj) = tanh(gxiTxj + r). (16)

Here, g, r, and d are kernel parameters that should be properly set to improve classification accuracy.

5 MOGA Approach to Feature Selection and Parameter Optimization

To deal with feature selection and model selection problems simultaneously, GAs can be applied directly. However, it is interesting to combine multiple feature selection criteria to evaluate a feature subset. Thus, this work proposes a MOGA-based method dealing with two competing objectives, i.e., maximization of classification accuracy and minimization of cardinality of feature subset.

In this work we have used NSGA II, a well-known MOGA, to resolve the multiobjective problem of feature selection. As the task is to optimize the feature subset and SVM parameters together, we have fused the SVM parameters and the feature subset in the same chromosome. The detailed description of the proposed MOGA-based method is as follows.

5.1 Chromosome Representation

The chromosome used in the proposed approach is composed of three parts. The first part represents the feature subset, the second part encapsulates the regularization parameter C, and, as we are confined only to the RBF kernel, the third part of the chromosome is used to represent γ (the RBF kernel parameter). Binary encoding has been the most popular and simplest way to represent feature subsets in GA- and MOGA-based feature selection problems. Thus, we have used binary encoding to represent the chromosome (shown in Figure 2). The first part of the chromosome consist of n_f number of bits, where n_f is the total number of features in the data set. In this part of the chromosome, the bit value “1” represents the presence of a feature in the subset and “0” indicates the absence of the corresponding feature.

Figure 2

Chromosome Representation.

n_C and n_γ number of bits have been used to represent C and γ, respectively. The number of bits used to represent C and γ depend on the precision required. The corresponding real values of the parameters C and γ are calculated by the following equation.

x = minx + maxx − minx2nx − 1 × x′,

where x represents the real value of the bit string represented as x′, n_x is the number of bits in the bit string, and max_x and min_x are the maximum and minimum values of parameter x, respectively.

5.2 Genetic Operators

One-point crossover, bit flip mutation, and binary tournament selection have been used as genetic operators.

5.3 Fitness Function

Following the multiobjective definition of feature selection, fitness of the individual is calculated with respect to two criteria instead of a single objective, i.e., maximization of classification accuracy and minimization of size of the feature subset. Thus, the fitness function is composed of fitness values of an individual with respect to each objective given by eqs. (17) and (18), respectively.

(17)f(1) = SVM_Accuracy, (17)

(18)f(2) = N − nf, (18)

where N is the total number of features in the data set and n_f is the number of features selected.

5.4 Proposed Procedure

The main steps employed in the proposed approach are described as follows:

Transformation and scaling of data: Each feature of the data set is first transformed to the format of the SVM package. Then, a simple scaling is conducted on the data to drag it in the range [0, 1]. This is done to save computational time. Moreover, it avoids domination of attributes having values in large numeric ranges over those in smaller numeric ranges.
Partitioning the data for cross-validation: Two-fold cross-validation has been adopted for evaluating the feature subsets. First, the whole data set is divided into two partitions in a ratio of 60:40. The former is called a training set and the latter is called a test set. The performance of the classifier that is used to evaluate the quality of the feature subset in each iteration cannot be judged on the test set. Thus, a part of the training set is reserved for the purpose of evaluating the classification performance inside the loop of the feature selection procedure. The training set is divided into two halves, i.e., training-training and training-test. Training-training is used for training the SVM model, and training-test is used for measuring the classification performance during the feature selection process. Finally, when the feature selection process terminates, the best feature subset and model parameters are fed to the classifier and its performance is calculated on the test set.
Extracting feature subset and parameters from the population: The chromosomes encapsulating features and model parameters are decoded in this step.
SVM training and fitness calculation: The extracted parameters C and γ from each chromosome are used to train SVM on reduced data set containing only the features indicated by each individual. Then, fitness is measured by running the classifier on the test set. The classification accuracy and cardinality of the feature subset are assigned as objective function values.
Non-dominated sorting for Pareto fronts: Non-domination ranks are calculated for each individual. For each solution, two quantities are calculated: (i) domination count n_p, the number of solutions that dominate the solution p, and (ii) S_p, a set of solutions that the solution p dominates. All the solutions having domination count zero will be listed in the first front. Then, for each solution having domination count as zero, i.e., all solutions in the first front, we visit every member (q) of their set S_p and reduce their n_p by 1. If for any solution q, its n_p becomes zero, we put it in second front. Then, the above-described procedure is repeated for solutions in the second front and then the third front is identified. This process is repeated until all the fronts are known.
Sorting based on crowded comparison operator: NSGA II uses a crowded comparison approach for maintaining diversity in the population that is based on a density-estimation metric called the crowding distance. Every solution in the population has two attributes: (i) non-domination rank (i_rank) and (b) crowding distance (i_distance). The crowded-comparison operator (≺n) is defined as i ≺ nj if (irank < jrank)or ((irank = jrank) and (idistance > jdistance))
The population is sorted according to the crowded comparison operator.
Reproduction of population: The binary tournament selection, recombination, and mutation operators are applied to population P of size N to create first offspring population Q of size N.
Construction of combined population: A combined population R= P+ Q of size 2N is formed.
Rank and crowding distance calculation: Non-domination ranks and the crowding distance of the combined population are calculated. This new combined population R is sorted according to the crowed comparison operator.
Replacement strategy: The new population is formed by adding the solutions from the first front, i.e., solutions having first rank then subsequently from second front, and this process goes on until the size of new population exceeds N. Thereafter, the solutions of the last accepted front are sorted according to the crowding distance values, and the first points are picked up to full the population until size N.
Termination condition: When all the individuals fall in the same front, i.e., first front, the process terminates; otherwise, we proceed to the next generation.

5.5 Experimental Evaluations and Comparisons

5.5.1 Data Sets

To facilitate the comparison of the proposed approach with the GA-based feature and model selection [9], we have chosen nine data sets same as that used by Huang and Wang. The summary of data sets is given in Table 1. (Data sets are available on University of California Irvine Machine Learning Repository.)

Table 1

Data Set Description.

Sr. No.	Name	#Instances	#Features	#Classes
1	German (credit card)	1000	24	2
2	Australian (credit card)	690	14	2
3	Pima Indian Diabetes	760	8	2
4	Heart Disease (Statlog project)	270	13	2
5	Breast Cancer (Wisconsin)	699	10	2
6	Contraceptive Method Choice (CMC)	1473	9	3
7	Ionosphere	351	34	2
8	Iris	150	4	3
9	Sonar	208	60	2

5.5.2 Parameter Setting

NSGA II, used in the proposed work, was applied with the following parameters: pop size=20, number of generations=50, crossover probability=0.8, mutation probability=0.1. A tour size of 2 and a pool size half that of the pop size are taken as parameters for tournament selection. The termination criteria are that either the generation count reaches 50 or all the individuals appear in the first front, i.e., all the solutions become non-dominated.

As shown in Figure 2, the segmented chromosome encodes three values, i.e., the feature subset, RBF kernel parameter C, and γ. The length of the first segment (n_f) depends on the number of features in the data set, and thus varies from one data set to another. After reviewing the literature, we have set the searching ranges for parameter C as [0.03125, 32] and [0.00003052, 8] for γ. To span this range and for appropriate precision, the length of second segment (n_C) is taken as 19 and that of the third segment is taken as 17. The grid search algorithm taken for comparison purpose considers the same range of C and γ.

5.5.3 Performance Measures

As the proposed approach for feature selection and parameter optimization is oriented toward the task of classification, we have used the following measures to evaluate the classification performance of each chromosome: sensitivity, specificity, and accuracy, i.e., overall hit rate (OHR). Sensitivity measures the proportion of actual positives that are correctly identified, while specificity is the proportion of correctly classified negative examples. For a two-class problem, these measures can be defined as

Sensitivity (positive hit rate) = true_pos/posSpecificity (negative hit rate) = true_neg/neg,OHR = (true_pos+true_neg)/(pos+neg)

where true_pos refers to the positive tuples that were correctly predicted by the classifier and true_neg is the number of negative examples correctly classified. pos and neg are the total number of positive and negative examples.

For the multiple-class data sets, the accuracy is determined only by the OHR and is given by

OHR = ∑i = 1noctruei∑i = 1nocalli

where true_i is the measure of ith class tuples that are truly predicted, all_i is the total number of examples that fall in ith class, and noc is the number of classes.

The OHR is also recognized as the accuracy of the classifier.

Additionally, AUC [area under the receiver operating characteristic (ROC) curve], a single-number summary, has also been used to assess the classification performance more appropriately.

5.5.4 Computations, Results, and Discussion

The experimentation on NSGA II for feature selection and model selection was carried out in MATLAB environment on a third-generation Intel core processor running at 3.30 GHz with 3 GB RAM. For testing the classification performance of feature subsets, we have used LIBSVM version compatible with MATLAB. The resulting Pareto solutions in case of the German data set are summarized in Table 2 and shown in Figure 3. The results show that none of the solutions is superior to any other solution. Nevertheless, the solutions reflect a trade-off among the objectives. Figure 3 also reveals the diversity among solutions, which is a desirable quality of a good Pareto front [26]. The corresponding OHR, C, and γ values have been summed up in Table 3.

Table 2

Final Unique Non-dominated Solutions in Case of the German Data Set.

Sr. No.	Accuracy	#Features
1	76.25	19
2	77.0833	18
3	77.0833	18
4	77.9167	17
5	78.3333	16
6	79.1667	14
7	79.5833	13
8	80	12
Average	78.17708	15.875

Figure 3

Non-dominated Solutions in Case of the German Data Set.

Table 3

Optimized Parameters for the German Data Set Using the MOGA-Based Approach.

Sr. No.	OHR	Optimized C	Optimized γ
1	0.7625	6.893939	0.43979
2	0.7708	8.304794	0.195405
3	0.7708	8.936868	0.69736
4	0.7792	0.812104	0.197358
5	0.7792	8.93699	0.69736
6	0.7833	40.28385	1.307714
7	0.7833	40.78434	1.698584
8	0.7917	40.34665	0.374849
9	0.7958	40.28434	0.374849
10	0.7958	40.34665	0.374849

We have analyzed the frequency of features in all the non-dominated solutions returned from the MOGA-based feature selection method for some data sets, which shows that almost all relevant features (features with high information gains) get selected in most of the solutions, which reflects the high quality of the overall feature subsets. Tables 4 and 5 show the frequency of selected features in case of the German and Australian data sets for a population size of 20.

Table 4

Frequency of Selected Features in 20 Pareto Solutions Obtained for the German Data Set.

#Feature^a	1	3	6	4	5	…	17	19	8	16	11
Frequency	19	20	17	18	17	…	1	1	3	0	0

^aFeatures in decreasing order of information gains.

Table 5

Frequency of Selected Features in 20 Pareto Solutions Obtained for the Australian Data Set.

#Feature	8	10	13	…	11	1	3	2
Frequency	20	15	15	…	10	0	0	0

5.5.5 Comparison with the GA-Based Approach

The performance of the proposed MOGA-based method has been compared with the GA-based feature selection and parameter optimization already implemented in Reference [9]. We have summarized the negative hit rate (NHR), and OHR of the proposed approach and GA-based approach in Table 6, which shows that the OHR corresponding to every solution of MOGA-based approach is either comparable or slightly less than that of the GA-based approach. It illustrates the inherent optimizing ability of GAs in MOGAs.

Table 6

Comparison of the Proposed Approach with the GA-Based Approach.

Name	#Features	MOGA-Based Approach				GA-Based Approach
Name	#Features	#Selected Features	Avg. PHR	Avg. NHR	Avg. OHR %±Std	#Selected Features	Avg. PHR	Avg. NHR	Avg. OHR %
German	24	16.25±2.47	0.94	0.41	79.25±2.82	13±1.83	0.89	0.77	85.6±1.96
Australian	14	9.15±0.83	0.88	0.88	88.29±0.51	3±2.45	0.85	0.92	88.1±2.25
Diabetes	8	5.35±0.48	0.62	0.91	81.03±0.24	3.7±0.95	0.78	0.87	81.5±7.13
Heart Disease	13	9.6±0.50	0.91	0.89	90.15±3.09	5.4±1.85	0.94	0.95	94.8±3.32
Breast Cancer	10	6.5±0.51	0.97	0.95	96.64±0.91	1±0	0.98	0.89	96.19±1.24
CMC	9	6.88±0.46	N/A	N/A	58.76±2.34	5.4±0.53	N/A	N/A	71.22±4.15
Ionosphere	34	26±0	0.99	0.94	97.74±0.36	6±0	0.99	0.98	98.56±2.03
Iris	4	2.9±0.31	N/A	N/A	97.5±0.85	1±0	N/A	N/A	100±0
Sonar	60	46.15±2.2	0.89	0.99	92.5±1.27	15±1.1	0.98	0.98	98±3.5

5.5.6 Comparison with Grid Search Algorithm

Table 7 shows the comparison between the proposed approach and the grid search algorithm for model selection, which concludes that the classification performance of the MOGA-based method is better than the grid search method in case of each data set. It proves the superiority of the proposed method over the grid algorithm.

Table 7

Comparison of the Proposed Approach with the Grid Algorithm.

Name	#Features	MOGA-Based Approach			Grid Algorithm
Name	#Features	Avg. PHR	Avg. NHR	Avg. OHR %	Avg. PHR	Avg. NHR	Avg. OHR %
German	24	0.94	0.41	79.25±2.82	0.89	0.46	76±4.06
Australian	14	0.88	0.88	88.29±0.51	0.89	0.82	84.7±4.74
Diabetes	8	0.62	0.91	81.03±0.24	0.59	0.88	77.3±3.03
Heart Disease	13	0.91	0.89	90.15±3.09	0.75	0.90	83.7±6.34
Breast Cancer	10	0.97	0.95	96.64±0.91	0.98	0.94	95.3±2.28
CMC	9	N/A	N/A	58.76±2.34	N/A	N/A	53.53±2.43
Ionosphere	34	0.99	0.94	97.74±0.36	0.94	0.9	89.44±3.58
Iris	4	N/A	N/A	97.5±0.85	N/A	N/A	97.37±3.46
Sonar	60	0.89	0.99	92.5±1.27	0.65	0.9	87±4.22

5.5.7 ROC Analysis

ROC curves for four data sets have been plotted in Figure 4. Table 8 depicts the AUC of the MOGA-based method, GA-based method, and grid algorithm for all the binary class data sets. The fact shows that the GA-based method outperforms all the three methods, in terms of classification. The AUC of the proposed method is always greater than that of the grid algorithm but slightly inferior to that of the GA-based method.

Table 8

Average AUC for Two-Class Data Sets.

Data Set	MOGA-Based Approach	GA-Based Approach	Grid Algorithm
German	0.8354	0.8424	0.7886
Australian	0.8835	0.9019	0.7585
Diabetes	0.8278	0.82672	0.8258
Heart Disease	0.8754	0.9083	0.7495
Breast Cancer	0.9935	0.9990	0.9683
Ionosphere	0.9613	1	0.9424
Sonar	0.9554	0.9803	0.8094

Figure 4

ROC Curves for the (A) Heart Disease, (B) Diabetes, (C) Sonar, and (D) Ionosphere Data Sets.

5.6 Selecting a Solution from a Set of Pareto Optimal Solutions

The MOGA-based approach to feature selection produces multiple non-dominated solutions. There are various ways to select a solution out of all non-dominated solutions. We have applied the “no preference articulation” method [13] for generating the Pareto optimal solutions. Thus, all the solutions obtained are presented to the user or data miner who finally selects the best solutions. The user can choose the most promising solutions according to the measurement cost involved with the feature subset. An aggregation function composed of all objectives can be used to evaluate all the final Pareto optimal solutions and help in filtering a single best solution. We can also pick up the mostly occurring features in all the solutions to make a final best feature subset. Thus, there is a choice to select a feature subset out of many according to our requirement.

We have done cost analysis of two medical domain data sets – Pima Indian Diabetes data set and Heart Disease data set. The cost involved in measuring the features of these data sets has been listed in Tables 9 and 10 taken from Reference [20].

Table 9

Measurement Cost of Attributes in Diabetes Data Set.

Attribute	Group Cost
1. Number of times pregnant	$1.00
2. Glucose tolerance test	A: $17.61 if first test in group A; $15.51 otherwise
3. Diastolic blood pressure	$1.00
4. Triceps skin fold thickness	$1.00
5. Serum insulin test	A: $22.78 if first test in group A; $20.68 otherwise
6. Body mass index	$1.00
7. Diabetes pedigree function	$1.00
8. Age	$1.00

Table 10

Measurement Cost of Attributes in Heart Disease Data Set.

Attribute	Group Cost
1. Age	$1.00
2. Sex	$1.00
3. Chest pain type	$1.00
4. Rest blood pressure	$1.00
5. Serum cholesterol	A: $7.27 if first test in group A; $5.17 otherwise
6. Fasting blood sugar	A: $5.20 if first test in group A; $3.10 otherwise
7. Rest electrocardiographic	$15.50
8. Max heart rate	B: $102.90 if first test in group B; $1.00 otherwise
9. Exercise induced	C: $87.30 if first test in group C; $1.00 otherwise
10. Old peak	C: $87.30 if first test in group C; $1.00 otherwise
11. Slope	C: $87.30 if first test in group C; $1.00 otherwise
12. Major vessels	$100.90
13. Thal	B: $102.90 if first test in group B; $1.00 otherwise

Some of the solutions produced by the MOGA-based approach have been summarized in Tables 11 and 12 for both the data sets. The first preference of the user is to choose solutions with the largest predictive accuracy and least measurement cost. If no such solution exists, then the user can compromise a little on predictive accuracy to gain benefits in terms of measurement cost. For instance, in case of three listed solutions of the Diabetes data set, the user can opt for the first solution if measurement cost is the main concern; otherwise, the second solution is a fairly good solution that gives slightly better accuracy with a not so large cost. Similarly, in case of the Heart Disease data set, the first and third solutions both are superior to the second in terms of predictive accuracy as well as measurement cost.

Table 11

Non-dominated Solutions in Case of the Diabetes Data Set.

Solu. No.	1	2	3	5	6	7	Accuracy	Total Measurement Cost
1	0	1	1	0	0	0	77.17	18.61
2	1	1	1	0	1	1	80.97	21.61
3	0	1	0	1	1	0	79.34	39.29

“1” indicates the presence of feature and “0” indicates its absence.

Table 12

Non-dominated Solutions in Case of the Heart Disease Data Set.

Solu. No.	1	3	6	8	9	10	12	Accuracy	Total Measurement Cost
1	1	1	1	0	0	1	1	89.23	195.4
2	0	1	0	1	1	1	1	87.69	293.1
3	0	1	1	0	0	1	1	87.69	194.4

6 Conclusion and Future Dimensions

We have devised a method for simultaneous tuning of feature subsets and parameters for the SVM classifier. The results obtained from the experiments performed with the proposed MOGA-based method ensure its high classification performance. The proposed method is better than the grid search algorithm and almost comparable to GA-based method, when used with the SVM classifier. It results in multiple non-dominated solutions instead of a single solution that provides the user multiple feature subsets along with tuned SVM parameters for classification. The user can opt for any of the solutions according to his/her requirement and available resources.

This work can be extended to tune the parameters associated with classifiers other than SVM. In this work we haven't considered feature grouping while calculating measurement cost. The experimentation of the proposed method can be done along with feature grouping into selection process.

Corresponding author: Jyoti Ahuja, Department of Computer Science and Engineering, Guru Jambeshwar University of Science and Technology, Hisar, Haryana 125001, India, Tel.: +895 0598 477, e-mail: kwatra.jyoti@gmail.com

Bibliography

[1] A. Ben-Hur and J. Weston, A user’s guide to support vector machines, in: Data Mining Techniques for the Life Sciences, O. Carugo and F. Eisenhaber (eds.), pp. 223–239, Humana Press, New York, 2010.10.1007/978-1-60327-241-4_13Search in Google Scholar PubMed

[2] A. Boubezoul and S. Paris, Application of global optimization methods to model and feature selection, Pattern Recognit.45 (2012), 3676–3686.10.1016/j.patcog.2012.04.015Search in Google Scholar

[3] K. Deb, A. Pratap, S. Agarwal and T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput.6 (2001), 182–197.10.1109/4235.996017Search in Google Scholar

[4] A. H. F. Dias and J. A. de Vasconcelos, Multiobjective genetic algorithms applied to solve optimization problems, IEEE Trans. Magn.38 (2002), 1133–1136.10.1109/20.996290Search in Google Scholar

[5] M. E. El Alami, A filter model for feature subset selection based on genetic algorithm, Knowl.-Based Syst.22 (2009), 356–362.10.1016/j.knosys.2009.02.006Search in Google Scholar

[6] C. Emmanouilidis, A. Hunter and J. MacIntyre, A multiobjective evolutionary setting for feature selection and a commonality-based crossover operator, in: Proceedings of the 2000 Congress on Evolutionary Computation, 2000, vol. 1, pp. 309–316, 2000.Search in Google Scholar

[7] F. Friedrichs and C. Igel, Evolutionary tuning of multiple SVM parameters, Neurocomputing64 (2005), 107–117.10.1016/j.neucom.2004.11.022Search in Google Scholar

[8] T. M. Hamdani, J. -M. Won, A. M. Alimi and F. Karray, Multi-objective feature selection with NSGA II, in: Adaptive and Natural Computing Algorithms, B. Beliczynski, A. Dzielinski, M. Iwanowski and B. Ribeiro (eds.), pp. 240–247, Springer, Berlin, 2007.10.1007/978-3-540-71618-1_27Search in Google Scholar

[9] C.-L. Huang and C.-J. Wang, A GA-based feature selection and parameters optimization for support vector machines, Expert Syst. Appl.31 (2006), 231–240.10.1016/j.eswa.2005.09.024Search in Google Scholar

[10] M. M. Kabir, M. Shahjahan and K. Murase, A new local search based hybrid genetic algorithm for feature selection, Neurocomputing74 (2011), 2914–2928.10.1016/j.neucom.2011.03.034Search in Google Scholar

[11] A. Konak, D. W. Coit and A. E. Smith, Multi-objective optimization using genetic algorithms: a tutorial, Reliab. Eng. Syst. Saf.91 (2006), 992–1007.10.1016/j.ress.2005.11.018Search in Google Scholar

[12] S. Maldonado and R. Weber, A wrapper method for feature selection using Support Vector Machines, Inf. Sci.179 (2009), 2208–2217.10.1016/j.ins.2009.02.014Search in Google Scholar

[13] K. Miettinen, Nonlinear Multiobjective Optimization, vol. 12, Springer, Kluwer Academic Publishers, Boston, 1999.Search in Google Scholar

[14] A. Mukhopadhyay and U. Maulik, An SVM-wrapped multiobjective evolutionary feature selection approach for identifying cancer-microRNA markers, IEEE Trans. NanoBiosci.12 (2013), 275–281.10.1109/TNB.2013.2279131Search in Google Scholar PubMed

[15] A. Mukhopadhyay, U. Maulik, S. Bandyopadhyay and C. Coello Coello, A survey of multi-objective evolutionary algorithms for data mining: part-I, IEEE Trans. Evol. Comput.18 (2014), 4–19.10.1109/TEVC.2013.2290086Search in Google Scholar

[16] G. L. Pappa, A. A. Freitas and C. A. A. Kaestner, Attribute selection with a multi-objective genetic algorithm, in: Advances in Artificial Intelligence, G. Bittencourt and G. L. Ramalho (eds.), pp. 280–290, Springer, Berlin, 2002.10.1007/3-540-36127-8_27Search in Google Scholar

[17] B. Schèolkopf, K. Tsuda and J. -P. Vert, Kernel Methods in Computational Biology, MIT Press, Cambridge, 2004.10.7551/mitpress/4057.001.0001Search in Google Scholar

[18] F. Tan, X. Fu, Y. Zhang and A. G. Bourgeois, A genetic algorithm-based method for feature subset selection, Soft Comput.12 (2008), 111–120.10.1007/s00500-007-0193-8Search in Google Scholar

[19] K. C. Tan, E. F. Khor and T. H. Lee, Multiobjective Evolutionary Algorithms and Applications (Advanced Information and Knowledge Processing), Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.Search in Google Scholar

[20] P. Turney, Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm, J. Artif. Intell. Res. JAIR2 (1995), 369–409.10.1613/jair.120Search in Google Scholar

[21] M. Venkatadri and K. Srinivasa Rao, A multiobjective genetic algorithm for feature selection in data mining, Int. J. Comput. Sci. Inf. Technol.1 (2010), 443–448.Search in Google Scholar

[22] C.-M. Wang and Y.-F. Huang, Evolutionary-based feature selection approaches with new criteria for data mining: a case study of credit approval data, Expert Syst. Appl.36 (2009), 5900–5908.10.1016/j.eswa.2008.07.026Search in Google Scholar

[23] C.-H. Wu, G.-H. Tzeng, Y. -J. Goo and W. -C. Fang, A real-valued genetic algorithm to optimize the parameters of support vector machine for predicting bankruptcy, Expert Syst. Appl.32 (2007), 397–408.10.1016/j.eswa.2005.12.008Search in Google Scholar

[24] M. Zhao, C. Fu, L. Ji, K. Tang and M. Zhou, Feature selection and parameter optimization for support vector machines: a new approach based on genetic algorithm with feature chromosomes, Expert Syst. Appl.38 (2011), 5197–5204.10.1016/j.eswa.2010.10.041Search in Google Scholar

[25] E. Zitzler and L. Thiele, Multiobjective optimization using evolutionary algorithms – a comparative case study, in: Parallel Problem Solving from Nature – PPSN V, Lecture Notes in Computer Science, Volume 1498, Springer, pp. 292–301, 1998.10.1007/BFb0056872Search in Google Scholar

[26] E. Zitzler, K. Deb and L. Thiele, Comparison of multiobjective evolutionary algorithms: empirical results, Evol. Comput.8 (2000), 73–195.10.1162/106365600568202Search in Google Scholar PubMed

Received: 2014-7-13

Published Online: 2014-8-27

Published in Print: 2015-6-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Optimizing Feature Subset and Parameters for Support Vector Machine Using Multiobjective Genetic Algorithm

Abstract

1 Introduction

2 Related Work

3 Multiobjective Genetic Algorithms

4 Support Vector Machine

4.1 When Data Are Linearly Separable

4.2 When Data Are Linearly Inseparable

5 MOGA Approach to Feature Selection and Parameter Optimization

5.1 Chromosome Representation

5.2 Genetic Operators

5.3 Fitness Function

5.4 Proposed Procedure

5.5 Experimental Evaluations and Comparisons

5.5.1 Data Sets

5.5.2 Parameter Setting

5.5.3 Performance Measures

5.5.4 Computations, Results, and Discussion

5.5.5 Comparison with the GA-Based Approach

5.5.6 Comparison with Grid Search Algorithm

5.5.7 ROC Analysis

5.6 Selecting a Solution from a Set of Pareto Optimal Solutions

6 Conclusion and Future Dimensions

Bibliography

Journal and Issue

Articles in the same Issue