MOFSRank: A Multiobjective Evolutionary Algorithm for Feature Selection in Learning to Rank

Cheng, Fan; Guo, Wei; Zhang, Xingyi

doi:https://doi.org/10.1155/2018/7837696

Complexity

On this page

Abstract Introduction Preliminaries and Related Work Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2018 | Article ID 7837696 | https://doi.org/10.1155/2018/7837696

MOFSRank: A Multiobjective Evolutionary Algorithm for Feature Selection in Learning to Rank

Fan Cheng,^1,2Wei Guo,²and Xingyi Zhang^1,2

Academic Editor: Rongqing Zhang

Received28 May 2018

Revised23 Oct 2018

Accepted10 Nov 2018

Published02 Dec 2018

Abstract

Learning to rank has attracted increasing interest in the past decade, due to its wide applications in the areas like document retrieval and collaborative filtering. Feature selection for learning to rank is to select a small number of features from the original large set of features which can ensure a high ranking accuracy, since in many real ranking applications many features are redundant or even irrelevant. To this end, in this paper, a multiobjective evolutionary algorithm, termed MOFSRank, is proposed for feature selection in learning to rank which consists of three components. First, an instance selection strategy is suggested to choose the informative instances from the ranking training set, by which the redundant data is removed and the training efficiency is enhanced. Then on the selected instance subsets, a multiobjective feature selection algorithm with an adaptive mutation is developed, where good feature subsets are obtained by selecting the features with high ranking accuracy and low redundancy. Finally, an ensemble strategy is also designed in MOFSRank, which utilizes these obtained feature subsets to produce a set of better features. Experimental results on benchmark data sets confirm the advantage of the proposed method in comparison with the state-of-the-arts.

1. Introduction

As a central issue of many applications, such as document retrieval [1], collaborative filtering [2], and expert finding [3], learning to rank has attracted much focus in machine learning area during the last decade. Rank learning, when applied to document retrieval, is a task as follows [1]. In learning, a ranking model is constructed by using the training data that consists of queries, their corresponding retrieved documents, and relevance levels given by human annotators. In ranking, given a new query, the documents are sorted by using the trained ranking model.

Due to the wide usages, a great number of learning to rank algorithms have been proposed, which achieve the ranking models with high accuracies [4–11]. However, in several real ranking applications, such as image retrieval [12, 13] and biomarker finding [14], the number of features in training data is large, which brings great challenges to existing ranking methods, since many features in these applications are redundant or even irrelevant, which reduces the performance of ranking algorithms [15]. To tackle the issue, recently, considerable efforts have been made on designing feature selection algorithms for learning to rank. For example, Geng et al. proposed the first filter based work, termed Greedy Search Algorithm (GAS) for feature selection in learning to rank [15]. In GAS, the feature that maximized total importance scores and minimized total similarity scores was iteratively selected to obtain the final feature subset. Experimental results demonstrated the effectiveness of GAS, when compared with traditional ranking algorithms. Since then, many other filter based ranking algorithms have been developed [16–19]. Another type of feature selection algorithms for learning to rank belongs to the wrapper approach, where a rank learning algorithm is included in the feature selection procedure to create a good feature subset. BRTree [20], RankWrapper [21], BFS-Wrapper [22], and GreedyRankRLS [23] are the representative works of this type. Recently, embedded methods have been proposed to solve feature selection for learning to rank, where feature selection is embedded in the ranker construction by introducing a sparse regularization term. For example, RSRank [24], FenchelRank [25], and FSMRank [26] adopted L1 regularization term, whereas in the work of [27], an embedded based feature selection algorithm by using a nonconvex regularization was suggested.

The existing feature selection algorithms for learning to rank have shown promising performance in achieving the features with small number and high ranking accuracy. However, all these algorithms solve the problem by only considering the traditional optimization techniques, such as greedy method and gradient descent method. Different from them, in this paper, we tackle the issue by using evolutionary computation as the optimization technique. To be specific, a Multi-Objective evolutionary algorithm for Feature Selection in learning to Rank, named MOFSRank is proposed. The main contributions of this paper can be summarized as follows:(1)A multiobjective feature selection method with an adaptive mutation is suggested, where the features with high ranking accuracy and low redundancy are selected as the feature subsets. Based on the suggested method, a multiobjective evolutionary algorithm, named MOFSRank, is proposed for feature selection in learning to rank.(2)In MOFSRank, an instance selection strategy is developed to choose the informative instances from the training data, by which the redundant data is removed and the learning process of feature selection is sped up. In addition, an ensemble strategy is also designed in MOFSRank, where the selected feature subsets are further utilized to produce a set of better features.(3)The effectiveness of the proposed MOFSRank is evaluated on the benchmark data sets, and the experimental results show that compared with the existing work the algorithm we proposed has superior performance in terms of both ranking accuracy and number of selected features.

The remainder of the paper is organized as follows. In Section 2, the preliminaries and related work are presented. Section 3 gives the details of the proposed algorithm and empirical results by comparing our algorithm with several state-of-the-arts on the benchmark data sets are reported in Section 4. Section 5 concludes the paper and discusses the future work.

2.1. Learning to Rank

Learning to rank, when applied to document retrieval, can be described as a problem as follows. Assuming that there is a collection of queries for training, denoted as , each query is associated with a list of documents, , whose relevance to is given by a vector , where and is the number of ranks. There exists a total order between the ranks , where denotes the partial order. With the training data, learning to rank is to construct a ranking model , which for a given new query can rank the documents associated with such that more relevant documents are ranked higher than less relevant ones.

To obtain accurate ranking models, different learning to rank algorithms have been proposed, which can be divided into three categories: Pointwise approach, Pairwise approach, and Listwise approach [1]. The Pointwise approach uses each single document as a learning instance, and defines the loss function on individual documents [4, 28]. The Pairwise approach regards a pair of documents as a learning instance and transforms the ranking problem into binary classification on document pairs [5–7]. The Listwise approach solves the ranking problem in a straightforward fashion, which takes the entire ranked list of documents as a learning instance and defines a Listwise loss function for learning [8–11]. Among these three approaches, the Pairwise one has attracted much focus, since in the real ranking applications, such as search engine and recommendation system, the training data of this category can be easily obtained from the users’ click through [5]. More algorithms for learning to rank can be found in [1].

2.2. Feature Selection Methods for Learning to Rank

The different types of ranking algorithms have shown promising performance in achieving the models with high accuracy. However, in several real ranking applications, the number of training features is large, which brings great challenges to learning to rank algorithms. To tackle the issue, recently, researchers introduced feature selection to the ranking methods and a variety of feature selection algorithms for learning to rank have been suggested, which mainly fall into three categories: filter approach, wrapper approach, and embedded approach [27, 29].

The filter approach is independent of the ranking method, and one representative work is GAS proposed by Geng, which is also the first feature selection algorithm for learning to rank [15]. The basic idea of GAS is to select a subset of features with maximum total importance scores and minimum total similarity scores and use selected features to construct a ranking model. Experimental results on LETOR data sets have shown that GAS can achieve good ranking accuracy with a small number of features. Based on this work, several other filter based feature selection algorithms have been developed [16–19, 30].

Different from filter approach, the wrapper approach includes a rank learning algorithm in the feature subset evaluation step, where the ranking algorithm is used as a black box by a wrapper to evaluate the goodness (i.e. the ranking accuracy) of the selected features. Example algorithms include BRTree, which uses boosted regression trees [20], RankWrapper with Ranking SVM [21], BFS-Wrapper utilizing search [22], GreedyRankRLS with Rank RLS algorithm [23], and LMIR using smoothing language model [31].

Recently, embedded approach (Note that some researchers categorize feature selection algorithms into two groups, where embedded approach is included in wrapper approach.) has been suggested to solve feature selection for learning to rank, where feature selection and rank learning are integrated into one single process. For example, Sun et al. [24] proposed an embedded feature selection ranking algorithm, termed RSRank, where L1 regularization term was introduced into ranking optimization. The experiment results on OHSUMED and TD2003 data sets have shown that RSRank outperformed several baseline rankers with only selecting thirty percent features. In recognizing the competitiveness of RSRank, many other embedded based methods have emerged. Lai et al. suggested another L1 based ranking method, named FenchelRank [25], where Fenchel duality was used to solve the sparse ranking optimization. Empirical evaluations indicated that FenchelRank was not only better than the classical ranking algorithms but also provided better performance than RSRank. Following this work, Lai et al. further developed a new embedded feature selection algorithm for learning to rank, termed FSMRank [26]. The algorithm solved a joint convex optimization problem by simultaneously minimizing ranking error and conducting feature selection. Experiments on the LETOR collections demonstrated that FSMRank can obtain better results than the filter approach, such as GAS. Different from the algorithms above, which all used convex L1 regularization, Laporte et al. designed a feature selection algorithm for learning to rank with a nonconvex regularization, which resulted in both good ranking accuracy and a small number of selected features [27]. Other embedded feature selection ranking algorithms can also be found in [32, 33].

The algorithms mentioned above have shown the effectiveness of feature selection for learning to rank, and in this paper, we continue this research line by proposing a multiobjective evolutionary algorithm for ranking feature selection. Before giving the details of the proposed algorithm, it should be noted that recently, multiobjective evolutionary algorithms (MOEAs) have been successfully applied to solve different problems in machine learning areas, such as classification [34–36], clustering [37, 38], and pattern mining [39]. In the following, we will propose an MOEA for feature selection in learning to rank.

3. The Proposed Algorithm

The proposed algorithm (MOFSRank) is a feature selection algorithm. To be specific, it is a multiobjective feature selection algorithm for learning to rank, where Pairwise documents are used as the learning instances. To select feature subset from the training set with size ( is the number of training data), we first choose some informative instance subsets from the Pairwise training set and then feature selection is performed on those selected instance subsets. Lastly, the outputs of feature selection are combined together to achieve a better feature subset. The main procedure of MOFSRank is shown in Figure 1, which consists of three phases: instance selection phase, feature selection phase, and ensemble phase. In the first phase, a multiobjective evolutionary algorithm, termed MOIS, is suggested to select the informative instances from the original Pairwise training set, which has two advantages. First, it removes the possible noisy data in the original set and improves the quality of training set. Second, the instance selection reduces the number of training instances and makes the feature selection more efficient. In the second phase, the final nondominated solutions of MOIS are used for feature selection. To this end, an MOEA for feature selection (MOFS) is proposed, where ranking accuracy and number of the selected features are defined as two optimization objectives. In addition, an adaptive mutation probability is also designed in MOFS, by which the proposed method can choose the features with high ranking accuracy and low redundancy. In the last phase, a mixed coding based multiobjective ensemble algorithm, namely, MOEN, is developed, where the Pareto solutions in the second phase are utilized to produce a better feature subset as the final output. The framework of the proposed MOFSRank is demonstrated in Algorithm 1.

Input: : maximum generations of multi-objective instance selection, : population size of
multi-objective instance selection, : crossover probability of multi-objective instance selection,
: mutation probability of multi-objective instance selection, : maximum generations
of multi-objective feature selection, : population size of multi-objective feature selection, : crossover
probability of multi-objective feature selection, : mutation probability of multi-objective
feature selection, : maximum generations of multi-objective ensemble, : population size of
multi-objective ensemble, : crossover probability of multi-objective ensemble, : mutation
probability of multi-objective ensemble;
Output: The final selected feature subset ;
Reading original Pairwise training data set;
/Instance Selection Phase/
MOIS(,,,,);
/Feature Selection Phase/
MOFS(,,,,);
/Ensemble Phase/
MOEN(,,,,,);
Return ;

3.1. Instance Selection Phase

As mentioned before, in this paper, we focus on Pairwise ranking, whose training set is of size, where is the number of training data. Thus, before feature selection, an instance selection operation is carried on the Pairwise training set. To be specific, an MOEA named MOIS is proposed for instance selection, where two optimization objectives are the number of selected instances and the value of ( in general, the larger value of means better ranking performance; however, since the multiobjective optimization problem is often described as a minimum problem, thus, in this paper, we use as the second objective. ), where denotes the accuracy value measured by ranking metrics, such as NDCG or MAP. Thus, the corresponding multiobjective instance selection problem can be described aswhere denotes the selected instance subset, is the number of the instances in , and is a ranker learned on set. In this paper, we adopt linear SVM to create the ranker, which has been widely used in many feature selection algorithms for learning to rank, such as FenchelRank and FSMRank. is ranking accuracy of the learned ranker on original training set.

For the MOP1, we use binary encoding scheme, which means that the -th individual (instance subset) can be represented as , where , , is the total number of the instances in the original training set. If denotes that the -th instance is selected in the -th individual, otherwise means not. With this encoding scheme, the proposed MOIS adopts a similar framework of NSGA-II [40], and Algorithm 2 presents the procedure of MOIS in detail.

Input: : maximum generations of multi-objective instance selection, : population
size of multi-objective instance selection, : crossover probability of multi-objective instance selection,
: mutation probability of multi-objective instance selection, : original training data set;
Output: A set of non-dominated instance subsets ;
Initializing the population ;
for to do
Evaluating by two proposed objectives; // formula (1)
Binary Tournament ;
Variation ;
Environmental ;
end for
selecting the solutions on the Pareto front;
Return ;

3.2. Feature Selection Phase

We take the non-dominated solutions of MOIS as the training data sets, and the feature selection is carried on them. To this end, a bi-objective evolutionary algorithm with an adaptive mutation for feature selection (MOFS) is suggested, where two conflicting objectives are the number of features and the value of . Thus, the biobjective optimization problem for feature selection is defined aswhere denotes the selected feature subset and is the number of features in set. is the ranker learned with features in set. Since each is evaluated on the Pareto instance subsets, we choose the largest value as the value of .

We also use the binary encoding scheme for the MOP2. Thus, the -th individual (feature subset) in population is represented as , where , , is the total number of features in (original training data set). denotes that the -th feature is included in the -th individual, otherwise means not. With the binary encoding strategy, we solve the MOP2 by adopting a similar framework as NSGA-II. To further improve the performance of MOFS, an adaptive mutation strategy is also suggested, whose basic idea is from the intuition that during the mutation, the important features should have greater probability of being selected, whereas the redundant features should have greater probability of being removed. Thus, the suggested adaptive mutation probability is defined aswhere function denotes the value of -th bit in an individual . is the adaptive mutation probability of -th bit in , and is the basic mutation probability that used in NSGA-II. is a decaying factor and, in this paper, we set , where is the number of current generation. and represent the important degree and redundant degree of -th feature in , which are formally defined aswhere denotes the ranking accuracy value of the single -th feature on original training set. is Pearson’s correlation coefficient between the -th feature and the -th feature () in . By using the adaptive mutation strategy, we can select the features with high ranking accuracy and low redundancy. The whole procedure of MOFS is presented in Algorithm 3.

Input: : maximum generations of multi-objective feature selection, : population size of multi-objective
feature selection, : crossover probability of multi-objective feature selection, : mutation probability
of multi-objective feature selection, : a set of non-dominated instance subsets;
Output: a set of non-dominated feature subsets , and their corresponding rankers set ;
Initializing the population ;
for to do
/Evaluating by two proposed objectives with formula (2)/
for to do
calculating the number of non-zero features in ; the first objective value of -th individual
; select the ranker with the smallest value of
on the as the ranker of individual
; the second objective value of -th individual
end for
Binary Tournament ;
calculating with formulas (3), (5) and (6);
Variation ;
Environmental ;
end for
selecting the solutions on the Pareto front;
the corresponding ranker set of ;
Return and ;

3.3. Ensemble Phase

After the second phase, a set of nondominated solutions (feature subsets) are obtained. To produce a better final feature subset, a biobjective ensemble algorithm, named MOEN is proposed, where two optimization objectives are the number of selected features and the value of 1-RAccuracy with the selected features. The basic idea of MOEN is that a better feature subset can be achieved by weighted combining these nondominated solutions together. To this end, a mixed coding strategy is developed in MOEN, which consists of two parts. The first part uses the binary encoding, whose length denotes the number of different features in the nondominated solutions of MOFS, the -th bit corresponds to the -th feature, and if this bit is 1, means this feature is selected, 0 indicates otherwise. The second part utilizes real encoding, and its length equals , where is the number of Pareto solutions in the second phase. Figure 2 provides an example to illustrate the suggested mixed encoding scheme in detail.

In Figure 2, there is an individual . The first part of has 4 bits, which means there are 4 different features in the non-dominated solutions of MOFS. The second part consists of 3 sub-part, which indicates that the number of feature subsets is 3. Let assume they are , and , thus the -th sub-part denotes the ensemble weight for . During the optimization, for the individual , we need to calculate its two objectives. The value of the first objective (the number of selected features) can be easily obtained from the part1 of ind. To get the value of second objective ( the value of 1-RAccuracy of selected features), first, we should achieve the ranker - corresponding to ind. To this end, we utilize the non-dominated solutions of the second phase and the weights in the part2 of ind. To be specific, let suppose -; thus each is obtained by the following formula:where is an indicator function which returns 1 if the -th bit in part1 is 1 and 0 otherwise. denotes the value of -th bit in , and is the -th subpart of part2. represents for the value of -th bit in the ranker , where is the ranker that corresponds to the output feature subset in (Line of Algorithm 3).

In the following, we also take the individual in Figure 2 as an example, and show how to obtain the ensemble ranker - of in detail. Firstly, let us assume that , , and are corresponding rankers of to . Then from the part2 of ind, we have , , and . Thus the ensemble ranker -, where

With the ranker -, we can calculate the second objective of in population and obtain a new set of Pareto solutions by solving the biobjective ensemble algorithm. From the Pareto feature subsets, we choose the one with the minimal value of as the final output . The whole multiobjective ensemble (MOEN) algorithm is shown in Algorithm 4.

Input: : maximum generations of multi-objective ensemble, : population size of multi-objective
ensemble, : crossover probability of multi-objective ensemble, : mutation probability of multi-objective
ensemble, : a set of non-dominated feature subsets, : the set of corresponding rankers;
Output: the final feature subset ;
Initializing the population ;
for to do
/Evaluating by two proposed objectives with formula (2) /
for to do
calculating the number of non-zero features in ; the first objective value of -th individual
constructing the ranker with formula (7); corresponding to the -th individual
; the second objective value of -th individual
end for
Binary Tournament ;
Variation ;
Environmental ;
end for
- selecting the solutions on the Pareto front;
selecting the solution from - with the minimal value of ;
Return ;

4. Experiments

In this section, we empirically verify the performance of the proposed MOFSRank by comparing it with several state-of-the-arts ranking algorithms. To be specific, we first present the experimental setting (including the data sets, comparison algorithms, and evaluation measures) and then report the comparison results between the proposed algorithm and the baselines (including the classical ranking algorithms and the representative feature selection algorithms for learning to rank). Lastly, we discuss the effectiveness of the suggested strategies in MOFSRank.

4.1. Experiment Setting

4.1.1. Data Sets

We conduct our experiments on the publicly available LETOR data collections [41], which are considered as the benchmark data sets in learning to rank. We select four data sets (NP2004, HP2004, TD2004, and OHSUMED) from LETOR 3.0 and one data set (MQ2008) from LETOR 4.0. Among them, OHSUMED is a three-level ranking set, while others are all bilevel data sets. The detail characteristics of those data sets are depicted in Table 1.

It should be noted that in LETOR collections, each data set is divided into five-folds and each fold contains a training/validation/test set, respectively. In the following experiments, we adopt the same splits as LETOR provides and report the results by averaging on the five folds.

4.1.2. Comparison Algorithms

The comparison algorithms used in this paper can be divided into two categories. The first group is the classical ranking algorithms provided by the LETOR. In this paper, we select RankSVM-Primal [42], RankSVM-Struct [43], ListNet [8], and AdaRank-NDCG [11] as the comparison algorithms, among which the former two belong to Pairwise approach, while the latter two optimize Listwise loss functions. The second group of comparison algorithms are the recently suggested feature selection algorithms for learning to rank, which include FenchelRank [25], FSMRank [26], and a nonconvex regularization feature selection method for learning to rank, proposed by Laporte et al. [27]. It is worth noting that, in the work [27], the authors presented three algorithms, and we choose the one, termed , since it has the best mean performance on LETOR data sets.

For fair comparisons, we adopt the recommended parameters values for all comparison algorithms, which were suggested by the authors in their original papers. For the proposed MOFSRank, since it is composed of three sub-MOEAs (MOIS, MOFS, and MOEN), we need to set parameters for each sub-MOEA. The population sizes, cross probabilities and mutation probabilities of three sub-MOEAs are set to , , , where is the length of the individual in the sub-MOEAs. The maximum numbers of generation for MOIS, MOFS, and MOEN are set to , , and , respectively. For used in the second objective of each sub-MOEA, we adopt NDCG@10, which is a popular criterion to measure the accuracy of a ranking algorithm and, in the next section, we will discuss this criterion in detail.

4.1.3. Evaluation Measures

On the data sets above (NP 2004, HP2004, TD2004, MQ2008 and OHSUMED), we compare the proposed MOFSRank with several baselines, and the results of different algorithms are reported in terms of NDCG [44] and MAP [45], which are two most widely used metrics in learning to rank. NDCG (Normalized Discounted Cumulative Gain) is often used in the case with multilevel relevance judgments and, for a query, DCG score at position is formally defined aswhere is the relevance label of the -th document in the sorted list. Then Normalized DCG score at position in the ranking list of documents can be calculated by the equation as follows:where is the normalization constant so that the value of NDCG ranges from 0 to 1. In the rest of this paper, we use N@k as the abbreviation of NDCG@k.

Another evaluation metric is MAP (Mean Average Precision), which deals with binary relevance judgments: relevant and irrelevant. First, we shall introduce the definition of precision at , which denotes the proportion of relevant documents at the top positions:where is an indicator function. If the document at position is relevant, , otherwise . Then the average precision of a given query is defined as the follows:where and represent the total number of documents and relevant documents associated with query , respectively. Based on (11) and (12), MAP can be formally defined aswhere is the set of all queries.

4.2. Experimental Results and Analysis

4.2.1. Comparison Results between MOFSRank and Classical Ranking Algorithms

In the first part of experiments, we compare our method with several classical ranking algorithms, which are all the algorithms without using feature selection. Specifically, we evaluate MOFSRank with RankSVM-Primal, RankSVM-Struct, ListNet, and AdaRank-NDCG on five LETOR data sets. Table 2 presents the performances of different algorithms, averaged on five-folds.

From Table 2, we can find that on all data sets, the proposed algorithm performs significantly better than the existing classical ranking methods. The comparison results have shown that MOFSRank can achieve the best ranking accuracy on 53 of 55 statistical points, which demonstrates the superiority of MOFSRank on LETOR data set and indicates the effectiveness of feature selection for learning to rank.

4.2.2. Comparison Results between MOFSRank and Feature Selection Algorithms for Learning to Rank

In the second part of experiments, we are interested in how our MOFSRank performs, when compared with other feature selection baselines for learning to rank. To this end, we report the comparison results between the proposed algorithm and FenchelRank, FSMRank, and , which are all recently suggested ranking feature selection methods with good performances. Tables 3 and 4 depict the ranking accuracy and the number of the selected features with different algorithms on the LETOR data sets, averaged on five-folds.

It can be observed from Table 3 that the proposed MOFSRank achieves the highest ranking values on most statistical points, which is much better than the existing feature selection baselines for learning to rank. Here, we present a few statistics on different data sets in terms of N@10. On NP2004, HP2004 and MQ2008 data sets, MOFSRank obtains the NDCG values of 0.8543, 0.8622 and 0.2406. Compared to the second best algorithms (FSMRank), its performances increase 3.2%, 2.9% and 3.7%, respectively. On TD2004 data set, the value of N@10 for MOFSRank is 0.3560, which shows 11.1% improvement than the second best algorithms (). Similarly, the increase of MOFSRank on OHSUMED set is 0.1%, in comparison with the second best algorithm, FenchelRank.

Table 4 presents the number of selected features of different algorithms on LETOR data sets, averaged on five folds. From the table, we can find that, on NP2004, HP2004, TD2004, and OHSUMED data sets, the features selected by MOFSRank are much fewer than those of other baselines. On MQ2008 data set, the proposed algorithm achieves the second best performance, whose number of selected features is slightly larger than the nonconvex feature selection algorithm . The statistics in Tables 3 and 4 have demonstrated the competitiveness of MOFSRank, when compared with other feature selection algorithms for learning to rank.

To further investigate the performance of different feature selection algorithms on the LETOR data sets, in the following, we detailed report the value of N@10 (y-axis) with respect to different number of selected features (x-axis), and the results are plotted in Figure 3. Note that since three feature selection baselines cannot directly select a given number of features, we adopt the strategy used in [26], which can choose top best features from the whole features. From the figures, we can find that although the NDCG accuracy of different algorithms varies with the number of selected features, our MOFSRank can always achieve the best trade-off between the accuracy and the number of selected features, which indicates the superior performance of the proposed method.

(a) NP2004

(b) HP2004

(c) TD2004

(d) MQ2008

(e) OHSUMED

4.3. Effectiveness of the Suggested Strategies in MOFSRank

As mentioned before, in the proposed MOFSRank, three strategies (instance selection, adaptive mutation, and Pareto based ensemble) are suggested and, in the following, we will empirically investigate the influence of these strategies on the performance of MOFSRank for LETOR data sets, respectively.

4.3.1. Effectiveness of the Instance Selection Strategy

In the first phase of MOFSRank, an instance selection strategy is suggested, which can reduce the number of training instances, and improve the performance of MOFSRank. To verify this fact, we compare the proposed algorithm with MOFSRank-NonIS, which is the same one as our MOFSRank, except that it excludes the instance selection strategy, and uses the original Pairwise instances in the training set. The comparison results on LETOR data sets are shown from two aspects. Firstly, we present the real training instances of two algorithms in Table 5, where Ins of MOFSRank and Ins of MOFSRank-NonIS denote the numbers of real training instances of MOFSRank and MOFSRank-NonIS. It can be easily observed from Table 5, that on all the LETOR data sets the suggested instance selection strategy does reduces the training instances greatly, especially on the data sets with hundreds of thousands of training instances (TD2004 and OHSUMED), and the ratios of the selected instances are only 0.04 and 0.03.

Secondly, we take N@10 as the ranking measure, and plot the final non-dominated solutions obtained by MOFSRank and MOFSRank-NonIS in objective space in Figure 4. Note that due to space limitation, in the following experiments, we only list the results on one LETOR 3.0 data set (NP2004) and one LETOR 4.0 data set (MQ2008), and the results on other LETOR data sets are similar. As can be seen from Figure 4, on both data sets, the MOFSRank can obtain better nondominated solutions than the MOFSRank-NonIS, which demonstrates the effectiveness of the suggested instance selection strategy in MOFSRank.

(a) NP2004

(b) MQ2008

4.3.2. Effectiveness of the Adaptive Mutation Strategy

In the second phase of MOFSRank, an adaptive mutation strategy is developed, which can enhance the performance of MOFSRank. To confirm the fact, we compare the proposed method with MOFSRank-NonAM, where the adaptive mutation strategy is removed from the original MOFSRank. The final nondominated solutions obtained by MOFSRank and MOFSRank-NonAM in objective space for LETOR data sets are plotted in Figure 5, from which, we can find that compared with MOFSRank-NonAM, the MOFSRank achieves better nondominated solutions on the experimental sets, which indicates the effectiveness of the adaptive mutation strategy.

(a) NP2004

(b) MQ2008

4.3.3. Effectiveness of Pareto Based Ensemble Strategy

In the third phase of MOFSRank, to obtain a better feature subset, a Pareto based ensemble strategy is suggested, where the Pareto solutions of the second phase are combined together. In order to verify the effectiveness of this ensemble strategy, we compare the MOFSRank with MOFSRank-NonPE. The only difference between them lies in the fact that MOFSRank-NonPE does not include the Pareto based ensemble operation. The experimental results of two algorithms on LETOR data sets are plotted in Figure 6, from which we can clearly find that with the suggested ensemble strategy, the proposed algorithm achieves better nondominated solutions than MOFSRank-NonPE. This fact demonstrates the effectiveness of the suggested Pareto based ensemble strategy.

(a) NP2004

(b) MQ2008

5. Conclusion

In this paper, we have proposed a multiobjective evolutionary algorithm, termed MOFSRank, for feature selection in ranking. In MOFSRank, an MOEA for instance selection (MOIS) has been suggested, where the informative instances were chosen from the original training set and made the following feature selection more effective and efficient. Then a multiobjective feature selection (MOFS) algorithm with an adaptive mutation has been performed on these chosen instances subsets, which can obtain the features with high ranking accuracy and low redundancy. Finally, a multiobjective ensemble (MOEN) algorithm has been developed to integrate the Pareto solutions of MOFS, by which the performance of MOFSRank can be further improved. Experimental results on LETOR data sets have demonstrated the competitiveness of the proposed algorithm.

There still remains some interesting work related to MOFSRank that deserves to be further investigated. The proposed MOFSRank has shown that MOEA is a promising method to solve feature selection for learning to rank and, in this paper, we mainly focus on the Pairwise ranking approach. In the future, we plan to further design feature selection algorithm for other type of learning to rank approach, such as Listwise approach. In addition, in our MOFSRank, we adopt NSGA-II as the framework, it is also interesting to combine the proposed method with other frameworks of MOEA, such as MOEA/D [46], SPEA2 [47], and AR-MOEA [48].

Data Availability

The data used to support the findings of our study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61672033, 61502004, 61502001, and 61502012), the Natural Science Foundation of Anhui Province (1708085MF166), Humanities and Social Sciences Project of Chinese Ministry of Education (Grant no. 18YJC870004), and the Key Program of Natural Science Project of Educational Commission of Anhui Province (KJ2017A013).

References

T.-Y. Liu, “Learning to rank for Information retrieval,” Foundations and Trends in Information Retrieval, vol. 3, no. 3, pp. 225–231, 2009.
View at: Publisher Site | Google Scholar
Y. Shi, M. Larson, and A. Hanjalic, “Collaborative filtering beyond the user-item matrix: A survey of the state of the art and future challenges,” ACM Computing Surveys, vol. 47, no. 1, 2014.
View at: Google Scholar
C. Moreira, P. Calado, and B. Martins, “Learning to rank academic experts in the DBLP dataset,” Expert Systems with Applications, vol. 32, no. 4, pp. 477–493, 2015.
View at: Publisher Site | Google Scholar
D. Cossock and T. Zhang, “Subset ranking using regression,” in Proceedings of the Conference on Learning Theory, pp. 605–619, 2006.
View at: Publisher Site | Google Scholar | MathSciNet
T. Joachims, “Optimizing search engines using clickthrough data,” in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142, July 2002.
View at: Google Scholar
Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An efficient boosting algorithm for combining preferences,” Journal of Machine Learning Research, vol. 4, no. 6, pp. 933–969, 2003.
View at: Google Scholar | MathSciNet
C. Burges, T. Shaked, E. Renshaw et al., “Learning to rank using gradient descent,” in Proceedings of the 22nd International Conference on Machine Learning (ICML '05), pp. 89–96, ACM, August 2005.
View at: Publisher Site | Google Scholar
Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, “Learning to rank: from pairwise approach to listwise approach,” in Proceedings of the 24th International Conference on Machine Learning (ICML '07), pp. 129–136, ACM, Corvallis, Ore, USA, June 2007.
View at: Publisher Site | Google Scholar
F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li, “Listwise approach to learning to rank - Theory and algorithm,” in Proceedings of the International Conference on Machine Learning, pp. 1192–1199, 2008.
View at: Google Scholar
Y. Yue, T. Finley, F. Radlinski, and T. Joachims, “A support vector method for optimizing average precision,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 271–278, 2007.
View at: Publisher Site | Google Scholar
J. Xu and H. Li, “AdaRank: a boosting algorithm for information retrieval,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 391–398, 2007.
View at: Publisher Site | Google Scholar
J. Weston, S. Bengio, and N. Usunier, “Large scale image annotation: learning to rank with joint word-image embeddings,” Machine Learning, vol. 81, no. 1, pp. 21–35, 2010.
View at: Publisher Site | Google Scholar | MathSciNet
J. Yu, D. Tao, M. Wang, and Y. Rui, “Learning to Rank Using User Clicks and Visual Features for Image Retrieval,” IEEE Transactions on Cybernetics, vol. 45, no. 4, pp. 767–779, 2015.
View at: Publisher Site | Google Scholar
R. Leaman, R. I. Doğan, and Z. Lu, “DNorm: disease name normalization with pairwise learning to rank,” Bioinformatics, vol. 29, no. 22, pp. 2909–2917, 2013.
View at: Publisher Site | Google Scholar
X. Geng, T.-Y. Liu, T. Qin, and H. Li, “Feature selection for ranking,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 407–414, 2007.
View at: Google Scholar
G. Hua, M. Zhang, Y. Liu, S. Ma, and L. Ru, “Hierarchical feature selection for ranking,” in Proceedings of the International Conference on World Wide Web, WWW 2010, pp. 1113-1114, Raleigh, North Carolina, USA, 2010.
View at: Publisher Site | Google Scholar
K. D. Naini and I. S. Altingovde, “Exploiting Result Diversification Methods for Feature Selection in Learning to Rank,” in Proceedings of the European Conference on Information Retrieval, pp. 455–461, 2014.
View at: Publisher Site | Google Scholar
M. B. Shirzad and M. R. Keyvanpour, “A feature selection method based on minimum redundancy maximum relevance for learning to rank,” in Proceedings of the Ai & Robotics, pp. 1–5, 2015.
View at: Google Scholar
A. Gigli, C. Lucchese, F. M. Nardini, and R. Perego, “Fast feature selection for learning to rank,” in Proceedings of the International Conference on the Theory of Information Retrieval, pp. 167–170, 2016.
View at: Google Scholar
F. Pan, T. Converse, D. Ahn, F. Salvetti, and G. Donato, “Feature selection for ranking using boosted trees,” in Proceedings of the ACM Conference on Information and Knowledge Management, pp. 2025–2028, 2009.
View at: Google Scholar
H. Yu, J. Oh, and W. Han, “Efficient feature weighting methods for ranking,” in Proceedings of the ACM Conference on Information and Knowledge Management, pp. 1157–1166, 2009.
View at: Publisher Site | Google Scholar
V. Dang and B. Croft, “Feature selection for document ranking using best first search and coordinate ascent,” in Proceedings of the SIGIR Workshop on Feature Generation and Selection for Information Retrieval, pp. 1–5, 2010.
View at: Publisher Site | Google Scholar
T. Pahikkala, A. Airola, P. Naula, and T. Salakoski, “Greedy rankrls: a linear time algorithm for learning sparse ranking models,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 11–18, 2010.
View at: Publisher Site | Google Scholar
Z. Sun, T. Qin, Q. Tao, and J. Wang, “Robust sparse rank learning for non-smooth ranking measures,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 259–266, 2009.
View at: Google Scholar
H. Lai, Y. Pan, C. Liu, L. Lin, and J. Wu, “Sparse learning-to-rank via an efficient primal-dual algorithm,” Institute of Electrical and Electronics Engineers. Transactions on Computers, vol. 62, no. 6, pp. 1221–1233, 2013.
View at: Publisher Site | Google Scholar | MathSciNet
H.-J. Lai, Y. Pan, Y. Tang, and R. Yu, “FSMRank: Feature selection algorithm for learning to rank,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 6, pp. 940–952, 2013.
View at: Publisher Site | Google Scholar
L. Laporte, R. Flamary, S. Canu, S. Dejean, and J. Mothe, “Nonconvex regularizations for feature selection in ranking with sparse SVM,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 6, pp. 1118–1130, 2014.
View at: Publisher Site | Google Scholar
P. Li, C. J. C. Burges, and Q. Wu, “Mcrank: learning to rank using multiple classification and gradient boosting,” in Proceedings of the International Conference on Neural Information Processing Systems, pp. 897–904, 2007.
View at: Google Scholar
M. B. Shirzad and M. R. Keyvanpour, “A Systematic Study of Feature Selection Methods for Learning to Rank Algorithms,” International Journal of Information Retrieval Research, vol. 8, no. 3, pp. 46–67, 2018.
View at: Publisher Site | Google Scholar
X. Han and S. Lei, “Feature selection and model comparison on microsoft learning-to-rank data sets,” https://arxiv.org/abs/1803.05127, 2018.
View at: Google Scholar
Y. Lin, H. Lin, K. Xu, and X. Sun, “Learning to rank using smoothing methods for language modeling,” Journal of the Association for Information Science and Technology, vol. 64, no. 4, pp. 818–828, 2013.
View at: Publisher Site | Google Scholar
D. X. Sousa, S. D. Canuto, T. C. Rosa, W. S. Martins, and M. A. Gonçalves, “Incorporating Risk-Sensitiveness into Feature Selection for Learning to Rank,” in Proceedings of the 25th ACM International Conference on Information and Knowledge Management, pp. 257–266, ACM, 2016.
View at: Publisher Site | Google Scholar
L. Du, Y. Pan, J. Ding, H. Lai, and C. Huang, “EGRank: an exponentiated gradient algorithm for sparse learning-to-rank,” Information Sciences, vol. 467, pp. 342–356, 2018.
View at: Publisher Site | Google Scholar | MathSciNet
Z. Wang, M. Li, and J. Li, “A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure,” Information Sciences, vol. 307, pp. 73–88, 2015.
View at: Publisher Site | Google Scholar | MathSciNet
J. Lee, W. Seo, and D. W. Kim, “Effective Evolutionary Multilabel Feature Selection under a Budget Constraint,” Complexity, vol. 2018, Article ID 3241489, 14 pages, 2018.
View at: Publisher Site | Google Scholar
G. Acampora, F. Herrera, G. Tortora, and A. Vitiello, “A multi-objective evolutionary approach to training set selection for support vector machine,” Knowledge-Based Systems, vol. 147, pp. 94–108, 2018.
View at: Publisher Site | Google Scholar
W. Ying, Y. Xie, Y. Wu, B. Wu, S. Chen, and W. He, “Universal partially evolved parallelization of MOEA/D for multi-objective optimization on message-passing clusters,” Soft Computing, vol. 21, no. 18, pp. 5399–5412, 2017.
View at: Publisher Site | Google Scholar
X. Zhang, Y. Tian, R. Cheng, and Y. Jin, “A Decision Variable Clustering-Based Evolutionary Algorithm for Large-Scale Many-Objective Optimization,” IEEE Transactions on Evolutionary Computation, vol. 22, no. 1, pp. 97–112, 2018.
View at: Publisher Site | Google Scholar
X. Zhang, F. Duan, L. Zhang, F. Cheng, Y. Jin, and K. Tang, “Pattern Recommendation in Task-oriented Applications: A Multi-Objective Perspective,” IEEE Computational Intelligence Magazine, vol. 12, no. 3, pp. 43–53, 2017.
View at: Publisher Site | Google Scholar
K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002.
View at: Publisher Site | Google Scholar
T. Qin, T.-Y. Liu, J. Xu, and H. Li, “LETOR: A benchmark collection for research on learning to rank for information retrieval,” Information Retrieval, vol. 13, no. 4, pp. 346–374, 2010.
View at: Publisher Site | Google Scholar
O. Chapelle and S. S. Keerthi, “Efficient algorithms for ranking with SVMs,” Information Retrieval, vol. 13, no. 3, pp. 201–215, 2010.
View at: Publisher Site | Google Scholar
T. Joachims, “Training linear SVMs in linear time,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217–226, 2006.
View at: Google Scholar
K. Rvelin, Kek, and J. Inen, “Cumulated gain-based evaluation of ir techniques,” ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422–446, 2002.
View at: Publisher Site | Google Scholar
B. Y. Ricardo, R. N. Berthier et al., “Modern information retrieval,” ACM, vol. 43, no. 1, pp. 26–28, 1999.
View at: Google Scholar
H. B. Nguyen, B. Xue, H. Ishibuchi, P. Andreae, and M. Zhang, “Multiple reference points MOEA/D for feature selection,” in Proceedings of theGenetic and Evolutionary Computation Conference Companion, pp. 157-158, Berlin, Germany, 2017.
View at: Publisher Site | Google Scholar
E. Ziztler, M. Laumanns, and L. Thiele, “SPEA2: Improving the strength pareto evolutionary algorithm for multiobjective optimization,” in Evolutionary Methods for Design, Optimization, and Control, pp. 95–100, 2002.
View at: Google Scholar
Y. Tian, R. Cheng, X. Zhang, F. Cheng, and Y. Jin, “An Indicator Based Multi-Objective Evolutionary Algorithm with Reference Point Adaptation for Better Versatility,” IEEE Transactions on Evolutionary Computation, 2017.
View at: Google Scholar

Copyright

Copyright © 2018 Fan Cheng et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1311

Downloads

1266

Citations

Complexity

MOFSRank: A Multiobjective Evolutionary Algorithm for Feature Selection in Learning to Rank

Abstract

1. Introduction

2. Preliminaries and Related Work

2.1. Learning to Rank

2.2. Feature Selection Methods for Learning to Rank

3. The Proposed Algorithm

3.1. Instance Selection Phase

3.2. Feature Selection Phase

3.3. Ensemble Phase

4. Experiments

4.1. Experiment Setting

4.1.1. Data Sets

4.1.2. Comparison Algorithms

4.1.3. Evaluation Measures

4.2. Experimental Results and Analysis

4.2.1. Comparison Results between MOFSRank and Classical Ranking Algorithms

4.2.2. Comparison Results between MOFSRank and Feature Selection Algorithms for Learning to Rank

4.3. Effectiveness of the Suggested Strategies in MOFSRank

4.3.1. Effectiveness of the Instance Selection Strategy

4.3.2. Effectiveness of the Adaptive Mutation Strategy

4.3.3. Effectiveness of Pareto Based Ensemble Strategy

5. Conclusion

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright