Abstract
Due to the huge amount of data being generating from different sources, the analyzing and extracting of useful information from these data becomes a very complex task. The difficulty of dealing with big data optimization problems comes from many factors such as the high number of features, and the existing of lost data. The feature selection process becomes an important step in many data mining and machine learning algorithms to reduce the dimensionality of the optimization problems and increase the performance of the classification or clustering algorithms. In this paper, a set of hybrid and efficient genetic algorithms are proposed to solve feature selection problem, when the handled data has a large feature size. The proposed algorithms use a new gene-weighted mechanism that can adaptively classify the features into strong relative features, weak or redundant features, and unstable features during the evolution of the algorithm. Based on this classification, the proposed algorithm gives the strong features high priority and the weak features less priority when generating new candidate solutions. In the same time, the proposed algorithm tries to more concentrate on unstable features that sometimes appear and sometimes disappear from the best solutions of the population. The performance of proposed algorithms is investigated by using different datasets and feature selection algorithms. The results show that our proposed algorithms can outperform the other feature selection algorithms and effectively enhance the classification performance over the tested datasets.
Similar content being viewed by others
References
Aljawarneh, S. A., Alawneh, A., & Jaradat, R. (2017a). Cloud security engineering: Early stages of SDLC. Future Generation Computer Systems. https://doi.org/10.1016/j.future.2016.10.005.
Aljawarneh, S., Aldwairi, M., & Yassein, M. B. (2018). Anomaly-based intrusion detection system through feature selection analysis and building hybrid efficient model. Journal of Computational Science, 25, 152–160.
Aljawarneh, S. A., Moftah, R. A., & Maatuk, A. M. (2016). Investigations of automatic methods for detecting the polymorphic worms signatures. Future Generation Computer Systems, 60, 67–77. https://doi.org/10.1016/j.future.2016.01.020.
Aljawarneh, S. A., & Vangipuram, R. (2018). GARUDA: Gaussian dissimilarity measure for feature representation and anomaly detection in Internet of things. Journal of Supercomputing. https://doi.org/10.1007/s11227-018-2397-3.
Aljawarneh, S. A., Vangipuram, R., Puligadda, V. K., & Vinjamuri, J. (2017b). G-SPAMINE: An approach to discover temporal association patterns and trends in internet of things. Future Generation Computer Systems, 74, 430–443. https://doi.org/10.1016/j.future.2017.01.01344310.1016/j.future.2017.01.013.
Aljawarneh, S., Yassein, M. B., & Aljundi, M. (2017c). An enhanced J48 classification algorithm for the anomaly intrusion detection systems. Cluster Computing. https://doi.org/10.1007/s10586-017-1109-8.
Aljawarneh, S., Yassein, M. B., & Talafha, W. A. (2017d). A resource-efficient encryption algorithm for multimedia big data. Multimedia Tools and Applications, 76(21), 22703–22724. https://doi.org/10.1007/s11042-016-4333-y.
Aljawarneh, S., Yassein, M. B., & Talafha, W. A. (2017e). A multithreaded programming approach for multimedia big data: encryption system. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-017-4873-9.
Chen, C. P., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314–347.
Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1(1–4), 131–156.
Dua, D., & Karra Taniskidou, E. (2017). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml. Accessed 15 Jan 2018.
Frohlich, H., Chapelle, O., & Scholkopf, B. (2003). Feature selection for support vector machines by means of genetic algorithm. In Proceedings of 15th IEEE international conference on tools with artificial intelligence (pp. 142–148). IEEE.
George, G., Haas, M. R., & Pentland, A. (2014). Big data and management. Academy of Management Journal, 57(2), 321–326.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.
Hamdani, T. M., Won, J. M., Alimi, A. M., & Karray, F. (2007). Multi-objective feature selection with NSGA II. In International conference on adaptive and natural computing algorithms (pp. 240–247). Springer, Berlin.
Ho, R. (2012). Big data machine learning.
Hong, Z. Q., & Yang, J. Y. (1991). Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognition, 24(4), 317–324.
John Walker, S. (2014). Big data: A revolution that will transform how we live, work, and think. International Journal of Advertising, 33(1), 181–183. https://doi.org/10.2501/IJA-33-1-181-183.
Kalpana, G., Kumar, P. V., Aljawarneh, S., & Krishnaiah, R. V. (2018). Shifted adaption homomorphism encryption for mobile and cloud learning. Computers & Electrical Engineering, 65, 178–195.
Katal, A., Wazid, M., & Goudar, R. H. (2013). Big data: Issues, challenges, tools and good practices. In Sixth international conference on contemporary computing (IC3) (pp. 404–409). IEEE.
Liu, H., & Lei, Yu. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4), 491–502.
Liu, H., Motoda, H., Setiono, R., & Zhao, Z. (2010). Feature selection: An ever evolving frontier in data mining. In Proceedings of JMLR feature selection in data mining, vol. 10, Hyderabad, India, 2010 (pp. 4–13).
Liu, H., & Zhao, Z. (2009). Manipulating data and dimension reduction methods: Feature selection, encyclopedia of complexity and systems science (pp. 5348–5359). Berlin: Springer.
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.
Mao, Q., & Tsang, I. W.-H. (2013). A feature selection method for multivariate performance measures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9), 2051–2063.
Marcano-Cedeño, A., et al. (2010). Feature selection using sequential forward selection and classification applying artificial metaplasticity neural network. In IECON 2010-36th annual conference on IEEE industrial electronics society. IEEE.
Marill, T., & Green, D. M. (1963). On the effectiveness of receptors in recognition systems. IEEE Transactions on Information Theory, 9(1), 11–17.
Min, F., Hu, Q., & Zhu, W. (2014). Feature selection with test cost constraint. International Journal of Approximate Reasoning, 55(1), 167–179.
Mohammad, R., Thabtah, F. A., & McCluskey, T. L. (2014). Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25(2), 443–458.
Morita, M., Sabourin, R., Bortolozzi, F., & Suen, C. Y. (2003). Unsupervised feature selection using multi-objective genetic algorithms for handwritten word recognition. In Proceedings of seventh international conference on document analysis and recognition (pp. 666–670). IEEE.
Obermeyer, Z., & Emanuel, E. J. (2016). Predicting the future—Big data, machine learning, and clinical medicine. The New England Journal of Medicine, 375(13), 1216.
Oh, I. S., Lee, J. S., & Moon, B. R. (2004). Hybrid genetic algorithms for feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11), 1424–1437.
Oliveira, L. S., Sabourin, R., Bortolozzi, F., & Suen, C. Y. (2002). Feature selection using multi-objective genetic algorithms for handwritten digit recognition. In Proceedings of 16th international conference on pattern recognition (Vol. 1, pp. 568–571). IEEE.
Pudil, P., Novoviˇcová, J., & Kittler, J. V. (1994). Floating search methods in feature selection. Pattern Recognition Letters, 15(11), 1119–1125.
Radhakrishna, V., Aljawarneh, S. A., Kumar, P. V., & Janaki, V. (2018). A novel fuzzy similarity measure and prevalence estimation approach for similarity profiled temporal association pattern mining. Future Generation Computer Systems, 83, 582–595.
Stearns, S. D. (1976). On selecting features for pattern classifier. In Proceedings of 3rd international conference on pattern recognition, Coronado, CA, USA (pp. 71–75).
Tsai, C.-F., Eberle, W., & Chu, C.-Y. (2013). Genetic algorithms in feature and instance selection. Knowledge-Based Systems, 39, 240–247.
Whitney, A. W. (1971). A direct method of nonparametric measurement selection. IEEE Transactions on Computers, C-20(9), 1100–1103.
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data mining: Practical machine learning tools and techniques. Burlington: Morgan Kaufmann.
Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Sciences, 87, 9193–9196.
Wu, X., Zhu, X., Wu, G. Q., & Ding, W. (2014). Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.
Xue, B., Zhang, M., & Browne, W. N. (2013). Particle swarm optimization for feature selection in classification: A multi-objective approach. IEEE Transactions on Cybernetics, 43(6), 1656–1671.
Yassein, M. B., Aljawarneh, S., et al. (2017). A new elastic trickle timer algorithm for Internet of Things. Journal of Network and Computer Applications, 89, 38–47.
Zhang, J. (1992). Selecting typical instances in instance-based learning. In Proceedings of the ninth international machine learning conference (pp. 470–479). Aberdeen, Scotland: Morgan Kaufmann.
Zikopoulos, P., & Eaton, C. (2011). Understanding big data: Analytics for enterprise class hadoop and streaming data. New York: McGraw-Hill Osborne Media.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mohammed, T.A., Bayat, O., Uçan, O.N. et al. Hybrid Efficient Genetic Algorithm for Big Data Feature Selection Problems. Found Sci 25, 1009–1025 (2020). https://doi.org/10.1007/s10699-019-09588-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10699-019-09588-6