Skip to main content

Advertisement

Log in

Hybrid Efficient Genetic Algorithm for Big Data Feature Selection Problems

  • Published:
Foundations of Science Aims and scope Submit manuscript

Abstract

Due to the huge amount of data being generating from different sources, the analyzing and extracting of useful information from these data becomes a very complex task. The difficulty of dealing with big data optimization problems comes from many factors such as the high number of features, and the existing of lost data. The feature selection process becomes an important step in many data mining and machine learning algorithms to reduce the dimensionality of the optimization problems and increase the performance of the classification or clustering algorithms. In this paper, a set of hybrid and efficient genetic algorithms are proposed to solve feature selection problem, when the handled data has a large feature size. The proposed algorithms use a new gene-weighted mechanism that can adaptively classify the features into strong relative features, weak or redundant features, and unstable features during the evolution of the algorithm. Based on this classification, the proposed algorithm gives the strong features high priority and the weak features less priority when generating new candidate solutions. In the same time, the proposed algorithm tries to more concentrate on unstable features that sometimes appear and sometimes disappear from the best solutions of the population. The performance of proposed algorithms is investigated by using different datasets and feature selection algorithms. The results show that our proposed algorithms can outperform the other feature selection algorithms and effectively enhance the classification performance over the tested datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Aljawarneh, S. A., Alawneh, A., & Jaradat, R. (2017a). Cloud security engineering: Early stages of SDLC. Future Generation Computer Systems. https://doi.org/10.1016/j.future.2016.10.005.

    Article  Google Scholar 

  • Aljawarneh, S., Aldwairi, M., & Yassein, M. B. (2018). Anomaly-based intrusion detection system through feature selection analysis and building hybrid efficient model. Journal of Computational Science, 25, 152–160.

    Article  Google Scholar 

  • Aljawarneh, S. A., Moftah, R. A., & Maatuk, A. M. (2016). Investigations of automatic methods for detecting the polymorphic worms signatures. Future Generation Computer Systems, 60, 67–77. https://doi.org/10.1016/j.future.2016.01.020.

    Article  Google Scholar 

  • Aljawarneh, S. A., & Vangipuram, R. (2018). GARUDA: Gaussian dissimilarity measure for feature representation and anomaly detection in Internet of things. Journal of Supercomputing. https://doi.org/10.1007/s11227-018-2397-3.

    Article  Google Scholar 

  • Aljawarneh, S. A., Vangipuram, R., Puligadda, V. K., & Vinjamuri, J. (2017b). G-SPAMINE: An approach to discover temporal association patterns and trends in internet of things. Future Generation Computer Systems, 74, 430–443. https://doi.org/10.1016/j.future.2017.01.01344310.1016/j.future.2017.01.013.

    Article  Google Scholar 

  • Aljawarneh, S., Yassein, M. B., & Aljundi, M. (2017c). An enhanced J48 classification algorithm for the anomaly intrusion detection systems. Cluster Computing. https://doi.org/10.1007/s10586-017-1109-8.

    Article  Google Scholar 

  • Aljawarneh, S., Yassein, M. B., & Talafha, W. A. (2017d). A resource-efficient encryption algorithm for multimedia big data. Multimedia Tools and Applications, 76(21), 22703–22724. https://doi.org/10.1007/s11042-016-4333-y.

    Article  Google Scholar 

  • Aljawarneh, S., Yassein, M. B., & Talafha, W. A. (2017e). A multithreaded programming approach for multimedia big data: encryption system. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-017-4873-9.

    Article  Google Scholar 

  • Chen, C. P., & Zhang, C. Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314–347.

    Article  Google Scholar 

  • Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent Data Analysis, 1(1–4), 131–156.

    Article  Google Scholar 

  • Dua, D., & Karra Taniskidou, E. (2017). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml. Accessed 15 Jan 2018.

  • Frohlich, H., Chapelle, O., & Scholkopf, B. (2003). Feature selection for support vector machines by means of genetic algorithm. In Proceedings of 15th IEEE international conference on tools with artificial intelligence (pp. 142–148). IEEE.‏

  • George, G., Haas, M. R., & Pentland, A. (2014). Big data and management. Academy of Management Journal, 57(2), 321–326.

    Article  Google Scholar 

  • Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182.

    Google Scholar 

  • Hamdani, T. M., Won, J. M., Alimi, A. M., & Karray, F. (2007). Multi-objective feature selection with NSGA II. In International conference on adaptive and natural computing algorithms (pp. 240–247). Springer, Berlin.‏

  • Ho, R. (2012). Big data machine learning.‏

  • Hong, Z. Q., & Yang, J. Y. (1991). Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognition, 24(4), 317–324.

    Article  Google Scholar 

  • John Walker, S. (2014). Big data: A revolution that will transform how we live, work, and think. International Journal of Advertising, 33(1), 181–183‏. https://doi.org/10.2501/IJA-33-1-181-183.

    Article  Google Scholar 

  • Kalpana, G., Kumar, P. V., Aljawarneh, S., & Krishnaiah, R. V. (2018). Shifted adaption homomorphism encryption for mobile and cloud learning. Computers & Electrical Engineering, 65, 178–195.

    Article  Google Scholar 

  • Katal, A., Wazid, M., & Goudar, R. H. (2013). Big data: Issues, challenges, tools and good practices. In Sixth international conference on contemporary computing (IC3) (pp. 404–409). IEEE.‏

  • Liu, H., & Lei, Yu. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4), 491–502.

    Article  Google Scholar 

  • Liu, H., Motoda, H., Setiono, R., & Zhao, Z. (2010). Feature selection: An ever evolving frontier in data mining. In Proceedings of JMLR feature selection in data mining, vol. 10, Hyderabad, India, 2010 (pp. 4–13).

  • Liu, H., & Zhao, Z. (2009). Manipulating data and dimension reduction methods: Feature selection, encyclopedia of complexity and systems science (pp. 5348–5359). Berlin: Springer.

    Google Scholar 

  • Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.‏

  • Mao, Q., & Tsang, I. W.-H. (2013). A feature selection method for multivariate performance measures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9), 2051–2063.

    Article  Google Scholar 

  • Marcano-Cedeño, A., et al. (2010). Feature selection using sequential forward selection and classification applying artificial metaplasticity neural network. In IECON 2010-36th annual conference on IEEE industrial electronics society. IEEE.‏

  • Marill, T., & Green, D. M. (1963). On the effectiveness of receptors in recognition systems. IEEE Transactions on Information Theory, 9(1), 11–17.

    Article  Google Scholar 

  • Min, F., Hu, Q., & Zhu, W. (2014). Feature selection with test cost constraint. International Journal of Approximate Reasoning, 55(1), 167–179.

    Article  Google Scholar 

  • Mohammad, R., Thabtah, F. A., & McCluskey, T. L. (2014). Predicting phishing websites based on self-structuring neural network. Neural Computing and Applications, 25(2), 443–458.

    Article  Google Scholar 

  • Morita, M., Sabourin, R., Bortolozzi, F., & Suen, C. Y. (2003). Unsupervised feature selection using multi-objective genetic algorithms for handwritten word recognition. In Proceedings of seventh international conference on document analysis and recognition (pp. 666–670). IEEE.‏

  • Obermeyer, Z., & Emanuel, E. J. (2016). Predicting the future—Big data, machine learning, and clinical medicine. The New England Journal of Medicine, 375(13), 1216.

    Article  Google Scholar 

  • Oh, I. S., Lee, J. S., & Moon, B. R. (2004). Hybrid genetic algorithms for feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11), 1424–1437.

    Article  Google Scholar 

  • Oliveira, L. S., Sabourin, R., Bortolozzi, F., & Suen, C. Y. (2002). Feature selection using multi-objective genetic algorithms for handwritten digit recognition. In Proceedings of 16th international conference on pattern recognition (Vol. 1, pp. 568–571). IEEE.‏

  • Pudil, P., Novoviˇcová, J., & Kittler, J. V. (1994). Floating search methods in feature selection. Pattern Recognition Letters, 15(11), 1119–1125.

    Article  Google Scholar 

  • Radhakrishna, V., Aljawarneh, S. A., Kumar, P. V., & Janaki, V. (2018). A novel fuzzy similarity measure and prevalence estimation approach for similarity profiled temporal association pattern mining. Future Generation Computer Systems, 83, 582–595.

    Article  Google Scholar 

  • Stearns, S. D. (1976). On selecting features for pattern classifier. In Proceedings of 3rd international conference on pattern recognition, Coronado, CA, USA (pp. 71–75).

  • Tsai, C.-F., Eberle, W., & Chu, C.-Y. (2013). Genetic algorithms in feature and instance selection. Knowledge-Based Systems, 39, 240–247.

    Article  Google Scholar 

  • Whitney, A. W. (1971). A direct method of nonparametric measurement selection. IEEE Transactions on Computers, C-20(9), 1100–1103.

    Article  Google Scholar 

  • Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data mining: Practical machine learning tools and techniques. Burlington: Morgan Kaufmann.

    Google Scholar 

  • Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Sciences, 87, 9193–9196.

    Article  Google Scholar 

  • Wu, X., Zhu, X., Wu, G. Q., & Ding, W. (2014). Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.

    Article  Google Scholar 

  • Xue, B., Zhang, M., & Browne, W. N. (2013). Particle swarm optimization for feature selection in classification: A multi-objective approach. IEEE Transactions on Cybernetics, 43(6), 1656–1671.

    Article  Google Scholar 

  • Yassein, M. B., Aljawarneh, S., et al. (2017). A new elastic trickle timer algorithm for Internet of Things. Journal of Network and Computer Applications, 89, 38–47.

    Article  Google Scholar 

  • Zhang, J. (1992). Selecting typical instances in instance-based learning. In Proceedings of the ninth international machine learning conference (pp. 470–479). Aberdeen, Scotland: Morgan Kaufmann.

  • Zikopoulos, P., & Eaton, C. (2011). Understanding big data: Analytics for enterprise class hadoop and streaming data. New York: McGraw-Hill Osborne Media.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oguz Bayat.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mohammed, T.A., Bayat, O., Uçan, O.N. et al. Hybrid Efficient Genetic Algorithm for Big Data Feature Selection Problems. Found Sci 25, 1009–1025 (2020). https://doi.org/10.1007/s10699-019-09588-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10699-019-09588-6

Keywords

Navigation