A Similarity Function for Feature Pattern Clustering and High Dimensional Text Document Classification

Kotte, Vinay Kumar; Rajavelu, Srinivasan; Rajsingh, Elijah Blessing

doi:10.1007/s10699-019-09592-w

A Similarity Function for Feature Pattern Clustering and High Dimensional Text Document Classification

Published: 09 March 2019

Volume 25, pages 1077–1094, (2020)
Cite this article

Foundations of Science Aims and scope Submit manuscript

Vinay Kumar Kotte ORCID: orcid.org/0000-0002-3809-4722^1,2,
Srinivasan Rajavelu³ &
Elijah Blessing Rajsingh³

350 Accesses
11 Citations
Explore all metrics

Abstract

Text document classification and clustering is an important learning task which fits to both data mining and machine learning areas. The learning task throws several challenges when it is required to process high dimensional text documents. Word distribution in text documents plays a very key role in learning process. Research related to high dimensional text document classification and clustering is usually limited to application of traditional distance functions and most of the research contributions in the existing literature did not consider the word distribution in documents. In this research, we propose a novel similarity function for feature pattern clustering and high dimensional text classification. The similarity function proposed is used to carry supervised learning based dimensionality reduction. The important feature of this work is that the word distribution before and after dimensionality reduction is the same. Experiment results prove the proposed approach achieves dimensionality reduction, retains the word distribution and obtained better classification accuracies compared to other measures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Document Classification: An Approach Using Feature Clustering

A Hybrid Dimension Reduction Technique for Document Clustering

Text Dimensionality Reduction for Document Clustering Using Hybrid Memetic Feature Selection

References

Abualigah, L. M., Khader, A. T., Al-Betar, M. A., & Alomari, O. A. (2017). Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Systems with Applications, 84, 24–36.
Article Google Scholar
Adeli, E., Shi, F., An, L., Wee, C. Y., Wu, G. R., Wu, T., et al. (2016). Joint feature-sample selection and robust diagnosis of Parkinson’s disease from mri data. Neuroimage, 141, 206–219. https://doi.org/10.1016/j.neuroimage.2016.05.054.
Article Google Scholar
Aggarwal, C. (2007). Data streams models and algorithms. Cham: Springer.
Google Scholar
Aggarwal, C., Han, J., Wang, J., & Yu, P. (2003). A framework for clustering evolving data streams. In VLDB conference.
Aljawarneh, S., Radhakrishna, V., Kumar, P. V., & Janaki, V. (2016). A similarity measure for temporal pattern discovery in time series data generated by IoT. In 2016 international conference on engineering & MIS (ICEMIS), Agadir (pp. 1–4).
Aljawarneh, S. A., Radhakrishna, V., & Cheruvu, A. (2017a). Extending the Gaussian membership function for finding similarity between temporal patterns. In 2017 international conference on engineering & MIS (ICEMIS), Monastir (pp. 1–6).
Aljawarneh, S. A., Radhakrishna, V., Kumar, P. V., & Janaki, V. (2017b). G-SPAMINE: An approach to discover temporal association patterns and trends in internet of things. Future Generation Computer Systems, 74, 430–443. https://doi.org/10.1016/j.future.2017.01.013.
Article Google Scholar
Aljawarneh, S. A., & Vangipuram, R. (2018). GARUDA: Gaussian dissimilarity measure for feature representation and anomaly detection in Internet of things. The Journal of Super Computing. https://doi.org/10.1007/s11227-018-2397-3.
Article Google Scholar
Babcock, B., Babu, S., Datar, M., Motwani, R., & Widom, J. (2002). Models and issues in data stream systems. In Proceedings of PODS.
Bartell, B. T., Cottrell, G. W., & Belew, R. K. (1992). Latent semantic indexing is an optimal special case of multidimensional scaling. In Proceedings of SIGIR, ACM, USA (pp. 161–167).
Berka, T., & Vajteršic, M. (2011). Dimensionality reduction for information retrieval using vector replacement of rare terms. In Proceedings of TM.
Berka, T., & Vajteršic, M. (2013). Parallel rare term vector replacement: Fast and effective dimensionality reduction for text. Journal of Parallel and Distributed Computing, 73(3), 341–351.
Article Google Scholar
Bharti, K. K., & Singh, P. K. (2014). A three-stage unsupervised dimension reduction method for text clustering. Journal of Computational Science, 5(2), 156–169.
Article Google Scholar
Bingham, E., & Mannila, H. (2001). Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’01) (pp. 245–250). New York, NY: ACM. http://dx.doi.org/10.1145/502512.502546D.
Chang, J. H., & Lee, W. S. (2005). estWin: Online data stream mining of recent frequent item sets by sliding window method. Journal of Information Science, 31(2), 7690.
Article Google Scholar
Charikar, M., Callaghan, L., & Panigrahy, R. (2003). Better streaming algorithms for clustering problems. In Proceedings of 35th ACM symposium on theory of computing.
Chen, Y. C., Peng, W. C., & Lee, S. Y. (2015). Mining temporal patterns in time interval-based data. IEEE Transactions on Knowledge and Data Engineering, 27(12), 3318–3331.
Article Google Scholar
Cox, T. F., & Cox, M. A. A. (2001). Multidimensional scaling. Boca Raton: Chapman & Hall.
Google Scholar
Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2004). Towards an adaptive approach for mining data streams in resource constrained environments. In Proceedings of sixth international conference on data warehousing and knowledge discovery. Lecture Notes in Computer Science (LNCS). Cham: Springer.
Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2005). Mining data streams—A review. SIGMODC Record, 34(2), 18–26.
Article Google Scholar
Gama, J. (2013). Knowledge discovery from databases. Boca Raton: CRC Press.
Google Scholar
Ganguly, D., Leveling, J., & Jones, G. J. F. (2015). Context-driven dimensionality reduction for clustering text documents. In P. Majumder, M. Mitra, M. Agrawal, & P. Mehta (Eds.), Proceedings of the 7th forum for information retrieval evaluation (FIRE’15) (pp. 1–7). New York, NY: ACM.
Han, J., Kamber, M., & Pei, J. (Eds.). (2012a). Advanced cluster analysis. In The morgan kaufmann series in data management systems, data mining (3rd ed., pp. 497–541). Morgan Kaufmann. https://doi.org/10.1016/B978-0-12-381479-1.00011-3.
Han, J., Kamber, M., & Pei, J. (Eds.). (2012b). Classification: Basic concepts. In The morgan kaufmann series in data management systems, data mining (3rd ed., pp. 327–391). Morgan Kaufmann. https://doi.org/10.1016/B978-0-12-381479-1.00008-3.
Hanneke, S. (2016). The optimal sample complexity OF PAC learning. Journal of Machine Learning Research, 17(38), 1–15.
Google Scholar
He, C., Dong, Z., Li, R., & Zhong, Y. (2008). Dimensionality reduction for text using LLE. In 2008 international conference on natural language processing and knowledge engineering, Beijing (pp. 1–7).
Hyvarinen, A., Karhunen, J., & Oja, E. (2004). Independent component analysis. Hoboken: Wiley.
Google Scholar
Jiang, J. Y., Cheng, W. H., Chiou, Y. S., & Lee, S. J. (2011a). A similarity measure for text processing. In 2011 international conference on machine learning and cybernetics, Guilin (pp. 1460–1465). https://doi.org/10.1109/icmlc.2011.6016998.
Jiang, J. Y., Liou, R. J., & Lee, S. J. (2011b). A fuzzy self-constructing feature clustering algorithm for text classification. IEEE Transactions on Knowledge and Data Engineering, 23(3), 335–349. https://doi.org/10.1109/TKDE.2010.122.
Article Google Scholar
Johnson, R. A., & Wichern, D. W. (2007). Applied multivariate statistical analysis (6th ed.). Upper Saddle River: Prentice Hall.
Google Scholar
Lin, Y. S., Jiang, J. Y., & Lee, S. J. (2014). A similarity measure for text classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 26(7), 1575–1590. https://doi.org/10.1109/TKDE.2013.19.
Article Google Scholar
Lin, Y., et al. (2013). A similarity measure for text classification and clustering. IEEE Transactions of Knowledge and Data Engineering, 26, 1575–1590.
Article Google Scholar
Mallick, K., & Bhattacharyya, S. (2012). Uncorrelated local maximum margin criterion: An efficient dimensionality reduction method for text classification. Procedia Technology, 4, 370–374.
Article Google Scholar
Neagoe, V. E., & Neghina, E. C. (2016). Feature selection with ant colony optimization and its applications for pattern recognition in space imagery. In IEEE ICC.
Paatero, P., & Tapper, U. (1994). Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126.
Article Google Scholar
Pang, G., Jin, H., & Jiang, S. (2013). An effective class-centroid-based dimension reduction method for text classification. In Proceedings of the 22nd international conference on World Wide Web (WWW’13 Companion) (pp. 223–224). New York, NY: ACM.
Phridviraj, M. S. B., Srinivas, C., & GuruRao, C. V. (2014). Clustering text data streams. A tree based approach with ternary function and ternary feature vector. Procedia Computer Science, 31, 976–984.
Article Google Scholar
Radhakrishna, V., Aljawarneh, S. A., Janaki, V., & Kumar, P. V. (2017b). Looking into the possibility for designing normal distribution based dissimilarity measure to discover time profiled association patterns. In 2017 international conference on engineering & MIS (ICEMIS), Monastir (pp. 1–5).
Radhakrishna, V., Aljawarneh, S. A., Kumar, P. V, et al. (2017c). ASTRA—A novel interest measure for unearthing latent temporal associations and trends through extending basic Gaussian membership function. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-017-5280-y.
Radhakrishna, V., Aljawarneh, S. A., Kumar, P. V., & Choo, K.-K. R. (2016a). A novel fuzzy Gaussian-based dissimilarity measure for discovering similarity temporal association patterns. Soft Computing, 1, 1. https://doi.org/10.1007/s00500-016-2445-y.
Article Google Scholar
Radhakrishna, V., Aljawarneh, S. A., Kumar, P. V., & Janaki, V. (2017a). A novel fuzzy similarity measure and prevalence estimation approach for similarity profiled temporal association pattern mining. Future Generation Computer Systems, 1, 1. https://doi.org/10.1016/j.future.2017.03.016.
Article Google Scholar
Radhakrishna, V., Kumar, P. V., Aljawarneh, S. A., & Janaki, V. (2017e). Design and analysis of a novel temporal dissimilarity measure using Gaussian membership function. In 2017 international conference on engineering & MIS (ICEMIS), Monastir (pp. 1–5).
Radhakrishna, V., Kumar, P. V., & Janaki, V. (2015). An approach for mining similarity profiled temporal association patterns using Gaussian based dissimilarity measure. In Proceedings of the international conference on engineering & MIS 2015 (ICEMIS’15).
Radhakrishna, V., Kumar, P. V., & Janaki, V. (2016b). A computationally optimal approach for extracting similar temporal patterns. In 2016 international conference on engineering & MIS (ICEMIS), Agadir (pp. 1–6).
Radhakrishna, V., Kumar, P. V., & Janaki, V. (2016c). Looking into the possibility of novel dissimilarity measure to discover similarity profiled temporal association patterns in IoT. In 2016 international conference on engineering & MIS (ICEMIS), Agadir (pp. 1–6).
Radhakrishna, V., Kumar, P. V., & Janaki, V. (2017a). A computationally efficient approach for mining similar temporal patterns. In Matoušek R (Eds.), Recent advances in soft computing. ICSC-MENDEL 2016. Advances in intelligent systems and computing (Vol. 576). Springer, Cham.
Radhakrishna, V., Kumar, P. V., & Janaki, V. (2017e). SRIHASS—A similarity measure for discovery of hidden time profiled temporal associations. Multimedia Tools and Applications.. https://doi.org/10.1007/s11042-017-5185-9.
Article Google Scholar
Radhakrishna, V., Kumar, P. V., & Janaki, V. (2017g). Design and analysis of similarity measure for discovering similarity profiled temporal association patterns. IADIS International Journal on Computer Science and Information Systems, 12(1), 45–60.
Google Scholar
Radhakrishna, V., Kumar, P. V., & Janaki, V. (2017h). Normal distribution based similarity profiled temporal association pattern mining (N-SPAMINE). Database Systems Journal, 7(3), 22–33.
Google Scholar
Radhakrishna, V., Kumar, P. V., & Janaki, V. (2018). Krishna Sudarsana: A Z-space similarity measure. In Proceedings of the fourth international conference on engineering & MIS 2018 (ICEMIS’18). New York, NY: ACM, Article 44, 4 pp.
Radhakrishna, V., Kumar, P. V., Janaki, V., & Aljawarneh, S. (2016d). A similarity measure for outlier detection in timestamped temporal databases. In 2016 international conference on engineering & MIS (ICEMIS), Agadir (pp. 1–5).
Radhakrishna, V., Kumar, P. V., Janaki, V., & Aljawarneh, S. (2016e). A computationally efficient approach for temporal pattern mining in IoT. In 2016 international conference on engineering & MIS (ICEMIS), Agadir (pp. 1–4).
Radhakrishna, V., Kumar, P. V., Janaki, V., & Cheruvu, A. (2017i). A dissimilarity measure for mining similar temporal association patterns. IADIS International Journal on Computer Science and Information Systems, 12(1), 126–142.
Google Scholar
Sammulal, P., Usha Rani, Y., & Yepuri, A. (2017). A class based clustering approach for imputation and mining of medical records (CBC-IM). IADIS International Journal on Computer Science & Information Systems, 12(1), 61–74.
Google Scholar
SureshReddy, G., Rajinikanth, T. V., & Ananda Rao, A. (2014). Design and analysis of novel similarity measure for clustering and classification of high dimensional text documents. In B. Rachev & A. Smrikarov (Eds.), Proceedings of the 15th international conference on computer systems and technologies (CompSysTech’14) (pp. 194–201). New York, NY: ACM. http://dx.doi.org/10.1145/2659532.2659615.
Tatbul, N., & Zdonik, S. (2006). A subset-based load shedding approach for aggregation queries over data streams. In Proceedings of international conference on very large data bases (VLDB).
Tsai, S. C., Jiang, J. Y., Wu, C., & Lee, S. J. (2009). A fuzzy similarity-based approach for multi-label document classification. In 2009 second international workshop on computer science and engineering, Qingdao (pp. 59–63). https://doi.org/10.1109/wcse.2009.766.
Uguz, H. (2012). A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems, 24, 1024–1032.
Article Google Scholar
Unler, A., Murat, A., & Chinnam, R. (2011). mr2pso: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Information Sciences, 181, 4625–4641.
Article Google Scholar
Usha Rani, Y., & Sammulal, P. (2017). An approach for imputation of medical records using novel similarity measure. In R. Matoušek (Eds.), Recent advances in soft computing. ICSC-MENDEL 2016. Advances in Intelligent Systems and Computing (Vol. 5). Cham: Springer.
Usha Rani, Y., Sammulal, P., & Golla, M. (2018). An efficient approach for imputation and classification of medical data values using class-based clustering of medical records. Computers & Electrical Engineering, 66, 487–504.
Article Google Scholar
VinayKumar, K., Srinivasan, R., & Singh, E. B. (2015). A feature clustering approach for dimensionality reduction and classification. In R. Matoušek (Eds.), Mendel 2015. ICSC-MENDEL 2016. Advances in intelligent systems and computing (Vol. 378). Cham: Springer.
Xu, X., Liang, T., Zhu, J., Zheng, D., & Sun, T. (2018). Review of classical dimensionality reduction and sample selection methods for large-scale data processing. Neurocomputing. ISSN 0925-2312.

Download references

Author information

Authors and Affiliations

Department of CSE, Karunya Institute of Technology and Sciences (Deemed to be university), Coimbatore, Tamilnadu, India
Vinay Kumar Kotte
Department of CSE, Kakatiya Institute of Technology and Science, Warangal, Telangana, India
Vinay Kumar Kotte
Karunya Institute of Technology and Sciences (Deemed to be university), Coimbatore, Tamilnadu, India
Srinivasan Rajavelu & Elijah Blessing Rajsingh

Authors

Vinay Kumar Kotte
View author publications
You can also search for this author in PubMed Google Scholar
Srinivasan Rajavelu
View author publications
You can also search for this author in PubMed Google Scholar
Elijah Blessing Rajsingh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vinay Kumar Kotte.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kotte, V.K., Rajavelu, S. & Rajsingh, E.B. A Similarity Function for Feature Pattern Clustering and High Dimensional Text Document Classification. Found Sci 25, 1077–1094 (2020). https://doi.org/10.1007/s10699-019-09592-w

Download citation

Published: 09 March 2019
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10699-019-09592-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Similarity Function for Feature Pattern Clustering and High Dimensional Text Document Classification

Abstract

Access this article

Similar content being viewed by others

Document Classification: An Approach Using Feature Clustering

A Hybrid Dimension Reduction Technique for Document Clustering

Text Dimensionality Reduction for Document Clustering Using Hybrid Memetic Feature Selection

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Similarity Function for Feature Pattern Clustering and High Dimensional Text Document Classification

Abstract

Access this article

Similar content being viewed by others

Document Classification: An Approach Using Feature Clustering

A Hybrid Dimension Reduction Technique for Document Clustering

Text Dimensionality Reduction for Document Clustering Using Hybrid Memetic Feature Selection

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation