Automatic Data Clustering Using Parameter Adaptive Harmony Search Algorithm and Its Application to Image Segmentation

Vijay Kumar; Jitender Kumar Chhabra; Dinesh Kumar

doi:10.1515/jisys-2015-0004

Open Access Published by De Gruyter October 13, 2015

Automatic Data Clustering Using Parameter Adaptive Harmony Search Algorithm and Its Application to Image Segmentation

Vijay Kumar
Vijay Kumar received his B.Tech. from the M.M. Engineering College, Mullana. He received his M.Tech. from the Guru Jambheshwer University of Science and Technology, Hisar. He received his PhD from the National Institute of Technology, Kurukshetra. He has been an Assistant Professor at the Department of Computer Science and Engineering, Thapar University, Patiala, Punjab. He has more than 8 years of teaching and research experience. He has more than 35 research papers in international journals, book chapters, and conference proceedings. He is on the panel of reviewers of Elsevier and Springer journals. His main research focuses on soft computing, image processing, data clustering, and multiobjective optimization.
, Jitender Kumar Chhabra
Jitender Kumar Chhabra received his B.Tech. and M.Tech. from the National Institute of Technology (formerly REC), Kurukshetra. He received his PhD from the GGS Indraprastha University, Delhi. He is currently working as a Professor at the National Institute of Technology, Kurukshetra, Haryana, India. He has more than 25 years of teaching and research experience. He has more than 85 research papers in international journals, book chapters, and conference proceedings. He is the author of three books from McGraw Hill including one in the Schaum Series International book from MC Graw Hill on “Programming With C”. He has delivered more than 20 expert talks and chaired many technical sessions in many national and international conferences of repute including of IEEE in the USA. He has visited many countries and presented his research work in USA, UK, Spain, France, Turkey, and Thailand. He is a reviewer of IEEE, Elsevier, Springer, and Wiley journals.
and Dinesh Kumar
Dinesh Kumar received his B.Tech. and M.Tech. degrees from the National Institute of Technology (formerly REC), Kurukshetra. He received his PhD from the GGS Indraprastha University, Delhi. He is currently working as a Professor at Guru Jambheshwer University of Science and Technology, Hisar, Haryana, India. He has more than 21 years of teaching and research experience. He has more than 60 research papers in international journals, book chapters, and conference proceedings. He has delivered expert talks in workshops and refresher courses and chaired technical sessions in national and international conferences. He is on the panel of reviewers of IEEE, Elsevier, and Springer journals. His areas of interest are image processing, pattern recognition, soft computing, and data mining.

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2015-0004

Abstract

In this paper, the problem of automatic data clustering is treated as the searching of optimal number of clusters so that the obtained partitions should be optimized. The automatic data clustering technique utilizes a recently developed parameter adaptive harmony search (PAHS) as an underlying optimization strategy. It uses real-coded variable length harmony vector, which is able to detect the number of clusters automatically. The newly developed concepts regarding “threshold setting” and “cutoff” are used to refine the optimization strategy. The assignment of data points to different cluster centers is done based on the newly developed weighted Euclidean distance instead of Euclidean distance. The developed approach is able to detect any type of cluster irrespective of their geometric shape. It is compared with four well-established clustering techniques. It is further applied for automatic segmentation of grayscale and color images, and its performance is compared with other existing techniques. For real-life datasets, statistical analysis is done. The technique shows its effectiveness and the usefulness.

Keywords: Harmony search algorithm; clustering; variance; meta-heuristics

1 Introduction

Data clustering is an unsupervised data mining technique that organizes a dataset into a number of groups (clusters) such that data points within a cluster are similar to each other than the data points belonging to other clusters. It has been applied in wide variety of fields including engineering, biology, social sciences, astronomy, geography, and computer science [18]. Many clustering algorithms have been developed to date. They are classified into broadly five categories: hierarchical clustering, partitional clustering, density-based clustering, grid-based clustering, and model-based clustering. In this paper, we have focused on the hard partitional clustering as it is widely used in pattern recognition [19]. In partitional clustering algorithms, each cluster can be represented by its cluster centroid and solution can be represented by the set of cluster centroids.

Most partitional clustering algorithms require predefined number of clusters. However, in real-life applications, there is no prior information regarding the number of clusters, and the optimal number cannot be estimated beforehand. If the number of clusters estimated is smaller than the actual number of clusters present, then, two or more clusters are combined in one cluster. If the number of clusters estimated is larger than the actual number of clusters present, then, the clusters are decomposed into smaller clusters. Therefore, the number of clusters is an important parameter for clustering, and it greatly affects the performance of clustering algorithms. Researchers used metaheuristic techniques to solve this problem. Recently, techniques such as genetic algorithm [6], differential evolution [13], tabu search [30], simulated annealing [7], particle swarm [31], and gravitational search [22] have been developed to solve automatic clustering.

The main contribution of this article is to develop an automatic data clustering using the recently developed parameter adaptive harmony search (ACPAHS). The ACPAHS algorithm treats the number of clusters as a variable and evolves it to an optimal number of clusters. In addition to determining the number of clusters, it has also been used to explore the optimal cluster centroids. The statistical property of dataset has been used in threshold setting and cutoff values for determining the number of clusters. The weighted Euclidean distance (considering threshold values) has been used for assigning the data points to a particular cluster. The novel fitness function has been developed to refine the search efficiently. The ACPAHS has been evaluated on well-known real-life datasets and compared with well-known clustering techniques such as automatic clustering using modified differential evolution (ACDE). dynamic clustering using particle swarm optimization (DCPSO), and genetic clustering with an unknown number of clusters (GCUK). The proposed approach has further been applied on five well-known grayscale images for segmentation.

The rest of this paper is structured as follows. Section 2 defines the basic concepts used in clustering and gives a brief overview of previous work done in the field of metaheuristic-based clustering techniques. Section 3 outlines the proposed ACPAHS algorithm. Section 4 presents the real-life datasets, parameter setting, and experimentation results. Conclusions are given in Section 5.

2 Background

In this section, we first describe the basic concepts of partitional clustering and parameter adaptive harmony search algorithm. Thereafter, the brief overview of related work done has also been mentioned in this section.

2.1 Basics of Clustering Problem

Let a dataset X={x₁, x₂, x₃, …, x_n} be a set of n data points and X_n×d be the data matrix with n rows and d columns. Each of the data point is described by d features, where x_j=(x_j1, x_j2, …, x_jd) is a vector representing the j^th data point, and x_ji represents the i^th feature of x_j. The goal of partitional clustering algorithm is to determine a partition C={C₁, C₂, …, C_K} such that [36]

(1)Ci ≠ ϕ, i = 1, 2, …, K

(2)∪i = 1KCi = X

(3)Ci∩Cj = ϕ, ∀i ≠ j

The data points that belong to the same cluster are similar to each other as possible, while data points that belong to other clusters are dissimilar as possible. For this, an appropriate fitness function is required to judge the quality of partition. The well-known fitness function is the mean-square error, which is defined as [5]:

(4)f(X, C) = ∑i = 1nmin{∥xi − Cl∥2 | l = 1, 2, …, K}

where ||x_i–C_l|| denotes the similarity between data point x_i and cluster center C_l. The Euclidean distance is mostly a used similarity metric, which is defined as:

(5)d(xi, xj) = ∑m = 1d|xim − xjm|2

The Euclidean distance as similarity metric has been used in this article. The clustering algorithm is modeled as the optimization problem.

2.2 Parameter Adaptive Harmony Search Algorithm

The concept of harmony search (HS) algorithm was first presented by Geem et al. [16]. It is a metaheuristic algorithm that imitates the music improvisation process where the musician improvises their instruments’ pitch by searching the prefect state of harmony. It has been successfully applied in a wide variety of optimization problems such as timetabling [2, 3], structure design [25], vehicle routing [17], Sudoku puzzle solving [15], tour planning [17], etc. The variants of HS have been reported in the literature [4, 26, 35].

To enrich the searching behavior and to avoid being trapped in a local optimum, Kumar et al. [21] proposed a parameter adaptive harmony search (PAHS) algorithm. In PAHS algorithm, the two control parameters named as harmony memory consideration rate (HMCR) and pitch adjustment rate (PAR), were being allowed to change dynamically. They have explored different cases of linear and exponential changes. The computational procedure of PAHS algorithm can be summarized as follows [21]:

Initialization of the optimization problem and algorithm parameters: The optimization problem can be defined as minimize (or maximize) f(x) such that x_i∈[LB_i, UB_i], i=1, 2, …, n, where f(x) is the objective function, x=(x₁, x₂, …, x_n) is the set of decision variables, n is the number of decision variables. LB_i and UB_i are the lower and upper bounds of decision variable x_i, respectively. The parameters of the PAHS are harmony memory size (HMS), range of harmony memory consideration rate (HMCR_min, HMCR_max), range of pitch adjustment rate (PAR_min, PAR_max), range of distance bandwidth (BW_min, BW_max), and number of improvisation (NI).
Initialization of harmony memory (HM): The HM consists of HMS harmony vectors. It is filled with randomly generated solution vectors and sorted by the values of the objective function f(x).
Improvisation of new harmony: A new harmony vector x′ = (x′1, x′2, …, x′n) is generated using three rules: memory consideration, pitch adjustment, and random selection as follows:
- For each i∈[1, n] do
  HMCR = HMCRmin + (HMCRmax − HMCRmin)NI × gnPAR = PARmax × e(ln(PARmin/PARmax)NI × gn)BW = BWmax × e(ln(BWmin/BWmax)NI × gn)
- ifU(0, 1)≤HMCRthen /* memory consideration */
  begin
  x′i = xil where l ∼ U(1, 2, …, HMS)
  ifU(0, 1)≤PARthen/* pitch adjustment */
  begin
  x′i = x′i ± BW × Rand, Rand ∼ U(0, 1)
  endif
  else/* random selection */
  x′i = LBi + (UBi − LBi) × Rand
  endif
  done
Updation in HM: If the newly generated harmony vector x′ = (x′1, x′2, …, x′n), evaluated in terms of objective function value, is better than the worst harmony vector in HM, it replaces the worst harmony vector. This is the step of algorithm where a decision should be taken whether the new harmony vector is to be included in the HM or not.
Checking the termination criterion: if the maximum number of improvisation steps is reached, computation is terminated, and the algorithm returns the best harmony vector. Otherwise, Steps 3 and 4 are repeated.

2.3 Related Works

Recently, researchers tried to develop automatic clustering techniques using metaheuristic algorithms. Lee and Antonsson [23] utilized an evolutionary strategy-based method for clustering the dataset dynamically. They applied variable length genomes to search for both cluster centroids and the optimal number of clusters. Bandyopadhyay and Maulik [6] developed a variable string-length genetic algorithm for automatic clustering, which did not require any assumption on the shape of the dataset. Their algorithm was called genetic clustering for unknown K (GCUK). The chromosome encoded the cluster centers of a number of clusters, whose value normally varies. They used real cluster center encoding. Modified versions of crossover and mutation operations were used. The DB cluster validity index was used to compute the fitness of the chromosomes. For each chromosome, the cluster centers were extracted, and then, partition was obtained by assigning the data points to different clusters based on the minimum distance criterion. The cluster centers encoded in the chromosome were then replaced by their respective cluster centers.

Ye and Chen [37] used the hybridization of PSO and K-means algorithm for determining the cluster centroids of geometrical structure datasets automatically. Omran et al. [29] came up with an automatic hard clustering scheme called DCPSO. DCPSO approach automatically determined the optimal number of clusters and cluster centers with minimum user interference. The algorithm started off by partitioning the dataset into a relatively large number of clusters to reduce the effects of the initialization. They used binary PSO to select the optimal number of clusters. Finally, the cluster centers of the chosen clusters were refined through the K-means algorithm. The binary PSO was applied again on the cluster centers to find a new optimal number of clusters in the dataset. This process was repeated until convergence is reached. The algorithm was applied for segmentation of natural, synthetic, and multispectral images.

Abdule-Wahab et al. [1] used a scatter search for automatic clustering. Jarboui et al. [20] used a combinatorial particle swarm optimization (CPSO) to solve partitional clustering problems. Each particle was represented as a candidate solution to the clustering problem. A swarm of particles was initiated and flied through the solution space for targeting the optimal solution. Pan and Cheng [30] proposed an evolution-based tabu search approach (ETSA) that treated the number of clusters as variable and evolved it to an optimal number of clusters. The ETSA consisted of two stages: quantizing the given dataset into a number of user-specified clusters and searching a clustering structure for the best cluster validity.

Das et al. [13] presented a new DE-based strategy, called ACDE. The conventional DE algorithm was modified to improve its convergence properties. A new representation scheme for the search variables was also developed to find the optimal number of clusters. Two different sets of experiments with two different fitness functions, CS and DB measures, were conducted. Das et al. [12] used the modified version of classical PSO. Their algorithm employed a kernel-induced similarity measure instead of the sum-of-squares distance. A new particle representation scheme was used for selecting the optimal number of clusters. The kernel function was used to cluster the data that is linearly nonseparable in the original input space.

Das and Konar [10] proposed an evolutionary-fuzzy clustering algorithm for automatically grouping the pixels of an image. They incorporated the fuzzy concept in ACDE for image segmentation. Das et al. [14] utilized the bacterial evolutionary algorithm for automatic clustering by using a population of variable-length chromosomes to encode an entire grouping of the data. They incorporated a new operation named chromosome repair and modified both mutation and gene transfer operations to handle variable-size chromosomes. Das and Sil [11] used a modified differential evolution for clustering the pixels of a grayscale image. They used the same concept of ACDE with a kernel-induced similarity measure. Quadfel et al. [31] presented a new automatic clustering algorithm based on a modified version of particle swarm optimization. Their algorithm used variable-length particles, which evolve a number of clusters dynamically using mutation operators.

Saha et al. [32, 33] proposed variable string-length-based automatic clustering techniques, which utilized the symmetry based distances. Lee and Chen [24] developed the improved DE algorithm with oscillation strategy for automatic data clustering. Cai and Gong [9] applied the differential evolution based on the point symmetry-based cluster validity index for automatic data clustering. The validity of the corresponding partitioning was measured through point symmetry-based cluster validity index. They used the Kd-tree-based nearest neighbor search to reduce the complexity of finding the closet symmetric point. Masoud et al. [27] proposed a modified version of the combinatorial PSO (CPSO-II) for automatically finding the best number of clusters and simultaneously categorizes the data points. CPSO-II used a renumbering procedure as a preprocessing step, and several extended PSO operators were used to increase the population diversity and remove redundant particles. The renumbering procedure increased the diversity of population, speed of convergence, and quality of solutions. Recently, Kumar et al. [22] proposed a novel automatic clustering technique using the gravitational search algorithm. They used the variance present in the dataset for clustering.

The hybridization of metaheuristic algorithms has also been in use in clustering problems. Niknam and Amiri [28] proposed a cluster optimization algorithm based on the combination of PSO, ant colony optimization, and K-means and so did Supratid and Kim [34] using a combination of GA, ACO, and fuzzy C-means. However, the parameter adaptive harmony search algorithm is yet to be applied to automatic data clustering of real-life datasets as well as image segmentation.

3 Proposed Automatic Data Clustering Using PAHS

3.1 Mathematical Foundation

The ACPAHS is inspired from well-developed ACDE [13]. The issues as mentioned below made us think of some idea to develop a new algorithm for automatic data clustering. These are:

The threshold setting corresponding to each cluster center does not consider the variance of data points belonging to that particular cluster center.
It uses the fixed value of threshold cutoff despite the fact that it is dataset dependent.
The effect of threshold value has not been considered in the refinement of cluster center.

Here, we investigate these issues and suggest solutions for automatic data clustering approach.

Threshold Setting: Das et al. [13] did not consider the variation present in the dataset for threshold setting, which decides the actual number of clusters. Using this fact, the new threshold setting concept for cluster center is proposed. The threshold value for each cluster center is set to the corresponding within-cluster variation. Hence, the proposed selection threshold measures the relevance of each cluster center.
(6)Thl = (1nl∑i = 1nl(xil − Cl)2), where l = 1, 2, …, Kmax
Here, Th_l denotes the selection threshold corresponding to cluster l. xil represents the i^th data points belonging to the cluster C_l. n_l denotes the number of data points belong to cluster C_l.
Cutoff Threshold: Das et al. [13] fixed the value of cutoff threshold to 0.5. However, it cannot be fixed to a particular value as it decides the number of optimal clusters present in the given dataset. So, it should be depended on the dataset.
Based on this fact, a novel cutoff threshold is proposed. The cutoff threshold value is to set the mean value of the sum of the threshold in each cluster center. The cutoff threshold (Th_cutoff) is mathematically formulated as:
(7)Thcutoff = mean(Thtotal)
where Th_total is the sum of the threshold in each cluster center and is defined as:
(8)Thtotal = ∑l = 1KmaxThl
Weighted Cluster Center Computation: Das et al. [13] did not consider the effect of threshold value during the cluster center refinement. To include this, weighted Euclidean distance is used in the proposed approach that does consider the effect of threshold during computation. It is mathematically formulated as:
(9)dw(xi, Cj) = (∑m = 1dwj2|xim − Cjm|2)
where w_j is the threshold assigned to the cluster center C_j. The more the value of w_j is, the better is the j^th cluster center in clustering.

3.2 Proposed Automatic Data Clustering Algorithm

The steps of the automatic data clustering using PAHS (ACPAHS) are represented with a flow chart in Figure 1. The pseudo-code of ACPAHS is given in Figure 2 and the steps of algorithm are given below.

Figure 1:

FlowChart of the ACPAHS Algorithm.

Figure 2:

Pseudo-Code of the ACPAHS Algorithm.

Algorithm: ACPAHS

Initialize the algorithm parameters, such as HMS, HMCR, PAR, BW, maximum number of improvisation steps, and maximum number of clusters (K_max).
Initialize each harmony vector such that it contains K_max number of randomly selected cluster centers and corresponding values of activation thresholds (see Section 3.2.1).
Repeat the following steps until the maximum number of improvisation steps is reached:
1. Find out the active cluster centers in each harmony vector with the help of the rule mentioned in Section 3.2.3.
2. For each data point, calculate its weighted Euclidean distance from all the active cluster centers of the i^th harmony vector.
3. Assign data points to a particular cluster center whose distance is minimum with respect to other cluster centers.
4. In case if the number of data points pertaining to any cluster is less than two, then reinitialize the cluster centers of the agent by an average computation described in Section 3.2.4.
5. Update the harmony vectors using the PAHS algorithm outlined in Section 2.2. The fitness of the harmony vectors is used to guide the search process mentioned in Section 3.2.5.
The best harmony vector will yield the optimal cluster centers and the optimal number of clusters at the final improvisation step.

These steps of ACPAHS algorithm are now described in the preceding subsections.

3.2.1 Solution Representation

In the ACPAHS, for n data points, each having d dimensions, and for K_max, a user-specified maximum number of clusters, a harmony vector consists of real numbers of dimension K_max+K_max×d. The first K_max entries indicate threshold values corresponding to the cluster centers to decide whether the cluster is to be activated or not. The remaining entries are used for K_max cluster centers, each having d dimensions. As an illustration, let us consider the following example.

Example 1. Let K_max=4 and d=3, i.e. the user-specified maximum number of clusters being considered as four and the space is three dimensional. Then, the harmony vector is shown below:

The first four entries (0.3, 0.7, 0.4, 0.8) represent selection thresholds corresponding to the clusters, and the remaining entries indicate four cluster centers, i.e. (4.9, 3.2, 1.6), (5.7, 4.4, 1.0), (6.9, 3.0, 4.9), and (7.7, 3.1, 2.4).

3.2.2 Harmony Memory Initialization

The K_max cluster centers encoded in the each harmony vector are initialized to K_max randomly chosen data points from the dataset. The selection thresholds are randomly generated in the range of [0, 1]. This process is repeated for each of the HMS harmony vector, where HMS is the size of the harmony memory.

3.2.3 Active Cluster Center Extraction

The cluster center in the harmony vector is active or selected and is based on the selection of the corresponding threshold value. If the threshold value is greater than the cutoff threshold value, then, the corresponding cluster center in the harmony vector is activated for partitioning the associated dataset. The rule for active cluster center extraction is given below:

IfTh_i,j>Th_cutoffthen the j^th cluster center is active
Elsej^th cluster center is inactive

Here, Th_i,j denotes the j^th cluster center in the i^th harmony vector. Th_cutoff is the cutoff threshold, whose computation is mentioned in Section 3.1.

When harmony vectors are updated, there might be some chances that none of the thresholds is greater than the cutoff threshold value. To eliminate this problem, we randomly select two thresholds and reinitialize them to values greater than the cutoff threshold value to ensure that the minimum number of possible clusters is two.

Example 2. The selection thresholds in the harmony vector considered in Example 1 are (0.3, 0.7, 0.4, 0.8). Let the cutoff threshold be 0.6. Then, according to the rule mentioned for active threshold, only two thresholds are higher than 0.6 (i.e. 0.7 and 0.8), and the corresponding second (5.7, 4.4, 1.0) and fourth (7.7, 3.1, 2.4) cluster centers have been activated for partitioning the dataset. The activate cluster centers are shown in rectangle.

3.2.4 Cluster Center Validation

There is a possibility that the number of data points assigned to a cluster center is less than two. This may be attributed to the fact that the selected cluster center (s) is (are) outside the boundary of the distribution of data points. To eliminate this problem, cluster center positions of particular harmony vector are reinitialized by an average computation method [13].

3.2.5 Fitness Function

A specified clustering criterion function is optimized for partitioning of a dataset. A large number of clustering criterion functions have been reported in the literature [18]. Most of these criteria are based on the within-cluster and between-cluster scatter matrices. In this article, we have used trace (SW−1SB), which consider both within- and between-cluster scatter. The within-cluster variation (S_W) computes how much scattered are the data points from their cluster center. It is mathematically defined as:

(10)SW = ∑j = 1K∑i = 1nνij(xi − Cj)(xi − Cj)T

where ν_ij is a partition matrix. ν_ij=1 if x_j∈cluster i, otherwise zero.

The between-cluster variation (S_B) computes how much scattered are the cluster center from the mean of the whole dataset. It is mathematically defined as:

(11)SB = ∑i = 1Kni(Ci − x¯)(Ci − x¯)T

In trace (SW−1SB),S_B is normalized by S_W. The larger value of this clustering criterion is required for high-quality clustering solutions. However, this criterion is biased toward increasing the number of clusters. To cope up this tendency, we have introduced the new penalty function, and the refined fitness function is defined as:

(12)fitness = trace (SW−1SB) × Kmax − KK − 1

4 Experimental Results and Discussions

In this section, the performance of the ACPAHS algorithm was compared with the three well-developed automatic partitional clustering algorithms such as ACDE, GCUK, and DCPSO using five benchmark datasets of varying levels of complexity.

4.1 The Datasets Used

In order to validate the performance of ACPAHS, we have carried out different experiments on five real-life datasets such as Iris, Wine, Glass, Breast Cancer, and Vowel. These datasets are obtained from the UCI machine learning repository [8]. A description of the datasets is depicted in Table 1.

Table 1

Datasets Used.

Dataset name	Instances	Features	Classes
Iris	150	4	3
Wine	178	13	3
Glass	214	9	6
Breast cancer	683	9	2
Vowel	871	3	6

4.2 Parameter Setting

The experiments were run for the different values of the parameters used in the ACPAHS, i.e. harmony memory size, maximum number of improvisation steps, range of HMCR, BW, and PAR. Thereafter, these parameters are fixed as follows: The size of the harmony memory is set to 30, respectively. The range of HMCR and PAR is set to [0.7, 0.99] and [0.01, 0.99], respectively. The BW is assigned to [0.001, 0.01]. The maximum number of improvisation steps is 200. ACPAHS is run 40 times for the fair comparison.

4.3 Results and Discussions

The performance of ACPAHS has been compared with ACDE [13], DCPSO [29], GUCK [6], and Classical DE [13]. The results have been computed in terms of mean and standard deviation over 40 independent runs in each case.

Table 2 demonstrates the optimal number of clusters obtained through the above-mentioned clustering techniques. For Iris dataset, ACPAHS produces three clusters in most of the runs. For Wine dataset, ACPAHS, ACDE, and DCPSO produce three clusters. For Glass dataset, ACPAHS and ACDE provide six clusters in almost each run. For Breast Cancer, ACDE and GUCK produce two clusters in every run, although ACPAHS also produces two clusters in most of the runs. For Vowel dataset, ACPAHS generates six clusters in almost each run. DCPSO, GUCK, and classical DE do not able to produce the optimal number of clusters.

Table 2

Mean and Standard Deviation of 40 Runs of Numbers of Clusters Obtained Over Five Real-Life Datasets.

Dataset	ACPAHS	ACDE	DCPSO	GCUK	DE
Iris	3.07 ±0.449	3.25±0.038	2.23±0.044	2.35±0.098	2.50±0.047
Wine	3.03±0.379	3.25±0.039	3.05±0.035	2.95±0.011	3.50±0.001
Glass	6.03±0.428	6.05±0.015	5.95±0.075	5.85±0.035	5.60±0.075
Cancer	2.10±0.262	2.00±0.000	2.25±0.063	2.00±0.008	2.25±0.026
Vowel	5.83±0.390	5.75±0.075	7.25±0.018	5.05±0.007	7.50±0.056

In addition to the number of clusters, the proposed approach was also compared in terms of inter-intra cluster ratio and number of fitness function evaluations. The former one measures separation and compactness among the clusters. The latter one measures the computational speed evolved in determining the number of clusters. Figure 3A shows the ratio of inter-cluster to intra-cluster distance among the above-mentioned five real-life datasets for clustering techniques. The results reveal that the ACPAHS produces compact and well-separated clusters compared to other existing clustering techniques. Figure 3B demonstrates the number of fitness function evaluations for the above-mentioned clustering. The number of fitness function evaluations for ACDE, DCPSO, GUCK, and Classical DE are quoted from [13]. The results illustrates that ACPAHS is faster than the other clustering techniques. Besides the fitness function evaluations, the execution time was also computed for all the datasets.

Figure 3:

Performance Comparison of ACPAHS Over Other Well-known Clustering Techniques in Terms of (A) Ratio of Inter- to Intra-cluster Distance; (B) Fitness Function Evaluations.

4.4 Statistical Evaluation

Here, we have done statistical tests to establish the superiority of the proposed approach, ACPAHS. The unpaired t-tests were done to determine whether the proposed approach is statistically significant or not. We have taken 40 as the sample size for the unpaired t-tests. Table 3 shows the results of the unpaired t-tests based on the number of clusters presented in Table 2. As can be seen from Table 3, ACPAHS is statistically significant compared to other techniques for all the datasets except the Glass dataset.

Table 3

Unpaired t-Test Between the Best and the Second Best Performing Algorithms for Each Dataset Based on the Data Presented in Table 2.

Dataset	Standard error	t	95% Confidence interval	Two-tailed P	Significance
Iris	0.071	2.526	−0.32184 to −0.03816	0.0135	Statistically significant
Wine	0.041	2.453	0.01886 to 0.18114	0.0164	Statistically significant
Glass	0.068	0.295	−0.15481 to 0.11481	0.7685	Not significant
Cancer	0.041	2.414	0.01753 to 0.18247	0.0181	Statistically significant
Vowel	0.063	12.450	0.65527 to 0.90473	<0.0001	Extremely significant

5 Application to Image Segmentation

5.1 Image Segmentation as a Clustering Problem

Image segmentation is a process of partitioning an image space into some non-overlapping meaningful homogenous regions. The problem of image segmentation is often posed as clustering in the intensity space. The automatic clustering technique has been applied to grayscale image for segmentation. Here, each pixel corresponds to a pattern, and image region corresponds to a cluster. The image segmentation is mathematically defined as follows:

Let there be a set of all image pixels. By applying segmentation on I, different non-overlapping regions R₁, R₂, …, R_n are formed such that

(13)∪i = 1nRi = I, where Ri∩Rj = ϕ

Every pixel of the image must be exhibited in one and only one segmented region. The proposed automatic clustering technique has been applied to grayscale images for segmentation.

5.2 Experimentation 1: Grayscale Image Segmentation

The developed technique has been applied to five well-known grayscale images; Clouds, Peppers, Science Magazine, Mumbai City, and Robot. The size of each image is 256×256. The intensity of each pixel serves as a feature for clustering. Therefore, the data points are single dimensional, the number of data points is 65,536. The same parameter setting is used as mentioned in Section 4.2. However, K_max is set to 25 as it is commonly used. The results are compared with ACDE, DCPSO, GUCK, and Classical DE. Figure 4 shows the original images and segmented images obtained from ACPAHS. The results, in terms of mean and standard deviations of number of clusters computed over 40 runs, are tabulated in Table 4.

Figure 4:

(A) The Original Cloud Image; (B) Segmented Image Obtained from ACPAHS (K=4); (C) The Original Peppers Image; (D) Segmented Image Obtained from ACPAHS (K=6); (E) The Original Science Magazine Image; (F) Segmented Image Obtained from ACPAHS (K=4); (G) The Original Mumbai City Image; (H) Segmented Image Obtained from ACPAHS (K=6); (I) The Original Robot Image; (J) Segmented Image Obtained from ACPAHS (K=4).

Table 4

Mean and Standard Deviation of 40 Runs of Automatic Clustering Using ACPAHS Over Five Grayscale Images.

Image	Optimal cluster range	ACPAHS	ACDE	DCPSO	GCUK	DE
Clouds	3–4	3.67±0.75	4.15±0.21	4.50±0.13	4.75±0.43	3.00±0.00
Peppers	4–8	6.03±0.51	7.05±0.04	6.85±0.06	3.90±0.45	8.50±0.07
Magazine	2–4	3.73±0.71	4.05±0.77	3.25±0.08	6.35±0.09	3.50±0.06
Mumbai	3–6	6.06±0.61	6.10±0.08	4.65±0.67	7.45±0.04	5.25±0.01
Robot	3–4	3.80±0.42	4.25±0.43	2.30±0.01	3.35±0.98	3.00±0.01

For Clouds image, ACPAHS produces four clusters, and the corresponding segmented image is shown in Figure 4B. DCPSO and GUCK provide five clusters, which are not the optimal number of clusters. ACPAHS produces the number of clusters that fall in the optimal range in almost every run for Peppers, Magazine, Mumbai City, and Robot, and the corresponding segmented images are shown in Figure 4D, F, H, and J, respectively. The results reveal that ACPAHS is able to detect the number of class in the grayscale image efficiently.

Table 5 shows the results of the unpaired t-tests based on the number of clusters of Table 4 between the best two algorithms. The results reveal that the ACGSA is statistically significant compared to the other techniques for image segmentation.

Table 5

Unpaired t-Test Between the Best and the Second Best Performing Algorithms for Each Image Based on the Data Presented in Table 4.

Image	Standard error	t	95% Confidence interval	Two-tailed P	Significance
Clouds	0.123	3.898	−0.7252 to −0.2348	0.0002	Extremely significant
Peppers	0.081	10.099	−0.9816 to −0.6584	<0.0001	Extremely significant
Magazine	0.113	4.249	0.2551–0.7049	<0.0001	Extremely significant
Mumbai	0.143	9.842	1.1248–1.6952	<0.0001	Extremely significant
Robot	0.169	2.669	0.1144–0.7856	0.0092	Statistical significant

5.3 Experimentation 2: Color Image Segmentation

The ACPAHS has also been applied to four well-known color images such as Lena, Mandrill, Jet, and Peppers. The size of each image is set to 256×256. Hence, the number of data points for each image is 65,536. The DCPSO is used for comparison. The parameter setting of the proposed and compared algorithm is the same as that used in the grayscale image. Figure 5 shows the original images and labeled images formed by cluster labels generated from ACPAHS.

Figure 5:

(A) The Original Lena Image; (B) Labeled Image Obtained from ACPAHS (K=6); (C) The Original Mandril Image; (D) Labeled Image Obtained from ACPAHS (K=7); (E) The Original Jet Image; (F) Labeled Image Obtained from ACPAHS (K=5); (G) The Original Peppers Image; (H) Labeled Image Obtained from ACPAHS (K=6).

The optimal number of clusters for the above-mentioned color images is mentioned in [22, 37]. The number of clusters obtained from the ACPAHS is depicted in Table 6. ACPAHS generates six and seven clusters in the case of the Lena and Mandrill images, respectively. For the Jet image, ACPAHS produces five clusters. For the Peppers image, ACPAHS carried out six clusters. The results depict that the number of clusters generated from both the ACPAHS and DCPSO are in the optimal range. For the statistical significance of ACPAHS, unpaired t-test is used. Table 7 shows the results of the unpaired t-test between ACPAHS and DCPSO for the above-mentioned color images. The results reveal that the ACGSA is statistically significant over DCPSO for image segmentation.

Table 6

Mean and Standard Deviation of 20 Runs of ACPAHS Over Four Color Images.

Image	Optimal cluster range	ACPAHS	DCPSO
Lena	5–10	5.78±0.44	6.85±0.48
Mandrill	5–10	7.00±0.00	6.25±0.43
Jet	5–7	5.00±0.00	5.30±0.46
Peppers	6–10	6.00±0.20	6.00±0.00

Table 7

Unpaired t-Test Between the ACPAHS and DCPSO Performing Algorithms for Each Image Based on the Data Presented in Table 6.

Image	Standard error	t	95% Confidence interval	Two-tailed P	Significance
Lena	0.416	7.348	−1.3648 to −0.7752	<0.0001	Extremely significant
Mandrill	0.096	7.800	0.5554 to 0.9446	<0.0001	Extremely significant
Jet	0.103	2.917	−0.5082 to −0.0918	0.0059	Statistically significant
Peppers	0.045	0.000	−0.091 to 0.091	1.000	Not significant

6 Conclusions

In this paper, the search ability of PAHS is utilized for automatically evolving the number of clusters and cluster centers. ACPAHS utilize the newly developed threshold setting and cutoff value. This enables the proposed approach to determine any types of clusters. In ACPAHS, the assignment of data points to different clusters is made on the basis of newly developed weighted Euclidean distance instead of Euclidean distance. ACPAHS does not require any prior specification of the number of clusters. The effectiveness of ACPAHS is also revealed in segmenting grayscale as well as color images. Experimental results show that the ACPAHS is not only able to automatically find the number of clusters but also the clustering result is better than the other four well-established clustering techniques. The clusters produced by ACPAHS are compact and well separated.

Corresponding author: Vijay Kumar, Thapar University, Patiala, Punjab, India, e-mail: vijaykumarchahar@gmail.com

About the authors

Vijay Kumar

Vijay Kumar received his B.Tech. from the M.M. Engineering College, Mullana. He received his M.Tech. from the Guru Jambheshwer University of Science and Technology, Hisar. He received his PhD from the National Institute of Technology, Kurukshetra. He has been an Assistant Professor at the Department of Computer Science and Engineering, Thapar University, Patiala, Punjab. He has more than 8 years of teaching and research experience. He has more than 35 research papers in international journals, book chapters, and conference proceedings. He is on the panel of reviewers of Elsevier and Springer journals. His main research focuses on soft computing, image processing, data clustering, and multiobjective optimization.

Jitender Kumar Chhabra

Jitender Kumar Chhabra received his B.Tech. and M.Tech. from the National Institute of Technology (formerly REC), Kurukshetra. He received his PhD from the GGS Indraprastha University, Delhi. He is currently working as a Professor at the National Institute of Technology, Kurukshetra, Haryana, India. He has more than 25 years of teaching and research experience. He has more than 85 research papers in international journals, book chapters, and conference proceedings. He is the author of three books from McGraw Hill including one in the Schaum Series International book from MC Graw Hill on “Programming With C”. He has delivered more than 20 expert talks and chaired many technical sessions in many national and international conferences of repute including of IEEE in the USA. He has visited many countries and presented his research work in USA, UK, Spain, France, Turkey, and Thailand. He is a reviewer of IEEE, Elsevier, Springer, and Wiley journals.

Dinesh Kumar

Dinesh Kumar received his B.Tech. and M.Tech. degrees from the National Institute of Technology (formerly REC), Kurukshetra. He received his PhD from the GGS Indraprastha University, Delhi. He is currently working as a Professor at Guru Jambheshwer University of Science and Technology, Hisar, Haryana, India. He has more than 21 years of teaching and research experience. He has more than 60 research papers in international journals, book chapters, and conference proceedings. He has delivered expert talks in workshops and refresher courses and chaired technical sessions in national and international conferences. He is on the panel of reviewers of IEEE, Elsevier, and Springer journals. His areas of interest are image processing, pattern recognition, soft computing, and data mining.

Bibliography

[1] R. S. Abdule-Wahab, N. Monmarché, M. Slimane, M. A. Fahdil and H. H. Saleh, A scatter search algorithm for the automatic clustering problem, in: Proceedings of Industrial Conference on Data Mining, pp. 350–364, 2006.10.1007/11790853_28Search in Google Scholar

[2] M. A. Al-Betar, A. T. Khader and T. A. Gani, A harmony search algorithm for university course timetabling, in: Proceedings of the 7th International Conference on the Practice and Theory of Automated Timetabling, Montreal, Canada, 2008.Search in Google Scholar

[3] M. A. Al-Betar, A. T. Khader and I. Liao, A harmony search with multi-pitch adjusting rate for university course timetabling, in: Recent Advances in Harmony Search Algorithm, Z. Geem, ed., vol. 270, pp. 147–161, Springer, 2010.10.1007/978-3-642-04317-8_13Search in Google Scholar

[4] O. M. Alia and R. Mandava, The variants of the harmony search algorithm: an overview, Artif. Intell. Rev. 36 (2011), 49–68.10.1007/s10462-010-9201-ySearch in Google Scholar

[5] B. Amiri, L. Hossain and S. E. Mosavi, Applications of harmony search algorithm on clustering, in: Proceedings of the World Congress on Engineering and Computer Science, pp. 460–465, 2010.Search in Google Scholar

[6] S. Bandyopadhyay and U. Maulik, Genetic clustering for automatic evolution of clusters and application to image segmentation, Pattern Recognit. 35 (2002), 1197–1208.10.1016/S0031-3203(01)00108-XSearch in Google Scholar

[7] S. Bandyopadhyay, U. Maulik and M. K. Pakhira, Clustering using simulated annealing with probabilistic redistribution, Int. J. Pattern Recognit. Artif. Intell. 15 (2001), 269–285.10.1142/S0218001401000927Search in Google Scholar

[8] C. L. Blake and C. J. Merz, UCI Repository of Machine Learning (1998), http://www.ics.uci.edu/_mlearn/databases/.Search in Google Scholar

[9] Z. Cai and W. Gong, A point symmetry-based clustering approach using differential evolution, J. Inf. Comput. Sci. 8 (2011), 1593–1608.Search in Google Scholar

[10] S. Das and A. Konar, Automatic image pixel clustering with an improved differential evolution, Appl. Soft Comput. 9 (2009), 226–236.10.1016/j.asoc.2007.12.008Search in Google Scholar

[11] S. Das and S. Sil, Kernel-induced fuzzy clustering of image pixels with an improved differential evolution algorithm, Inf. Sci. 180 (2010), 1237–1256.10.1016/j.ins.2009.11.041Search in Google Scholar

[12] S. Das, A. Abraham and A. Konar, Automatic kernel clustering with a multi-elitist particle swarm optimization algorithm, Pattern Recognit. Lett. 29 (2008), 688–699.10.1016/j.patrec.2007.12.002Search in Google Scholar

[13] S. Das, A. Abraham and A. Konar, Automatic clustering using an improved differential evolution algorithm, IEEE Trans. Syst. Man Cybernetics A Syst. Hum. 38 (2008), 218–237.10.1109/TSMCA.2007.909595Search in Google Scholar

[14] S. Das, A. Chowdhury and A. Abraham, A bacterial evolutionary algorithm for automatic data clustering, in: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 2403–2410, Trondheim, Norway, 2009.Search in Google Scholar

[15] Z. W. Geem, Harmony search algorithm for solving sudoku, in: Knowledge-Based Intelligent Information and Engineering Systems, B. Apolloni, R. J. Howlett, L. Jain, eds., vol. 4692, pp. 371–378, Springer, 2007.10.1007/978-3-540-74819-9_46Search in Google Scholar

[16] Z. W. Geem, J. H. Kim and G. V. Loganathan, A new heuristic optimization algorithm: harmony search, Simulation76 (2001), 60–68.10.1177/003754970107600201Search in Google Scholar

[17] Z. W. Geem, K. S. Lee and Y. Park, Application of harmony search to vehicle routing, Am. J. Appl. Sci. 2 (2005), 1552–1557.10.3844/ajassp.2005.1552.1557Search in Google Scholar

[18] A. K. Jain, M. N. Murty and P. J. Flynn, Data clustering: a review, ACM Computing Survey31 (1999), 264–323.10.1145/331499.331504Search in Google Scholar

[19] A. K. Jain, R. P. W. Duin and J. Mao, Statistical pattern recognition: a review, IEEE Trans. Pattern Anal. Mach. Intell.22 (2000), 4–37.10.1109/34.824819Search in Google Scholar

[20] B. Jarboui, M. Cheikh, P. Sarry and A. Rebai, Combinatorial particle swarm optimization (CPSO) for partitional clustering problem, Appl. Math. Comput. 192 (2007), 337–345.10.1016/j.amc.2007.03.010Search in Google Scholar

[21] V. Kumar, J. K. Chhabra and D. Kumar, Parameter adaptive harmony search algorithm for unimodal and multimodal optimization problems, J. Comput. Sci. 5 (2014), 144–155.10.1016/j.jocs.2013.12.001Search in Google Scholar

[22] V. Kumar, J. K. Chhabra and D. Kumar, Automatic cluster evolution using gravitational search algorithm and its application to image segmentation, Eng. Appl. Artif. Intell.29 (2014), 93–103.10.1016/j.engappai.2013.11.008Search in Google Scholar

[23] C. Y. Lee and E. K. Antonsson, Dynamic partitional clustering using evolutionary strategies, in: Proceedings of Asia–Pacific Conference on Simulated Evolution and Learning, IEEE Press, Nagoya, Japan, 2000.Search in Google Scholar

[24] W. P. Lee and S. W. Chen, Automatic clustering with differential evolution using cluster number oscillation method, in: International Workshop on Intelligent Systems and Applications, pp. 1–4, Wuhan, 2010.10.1109/IWISA.2010.5473289Search in Google Scholar

[25] K. S. Lee and Z. W. Geem, A new structural optimization method based on the harmony search algorithm, Comput. Struct. 82 (2004), 781–798.10.1016/j.compstruc.2004.01.002Search in Google Scholar

[26] D. Manjarres, I. Landa-Torres, S. Gil-Lopez, J. Del Ser, M. N. Bilbao, S. Salcedo-Sanz and Z. W. Geem, A survey on applications of the harmony search algorithm, Eng. Appl. Artif. Intell.26 (2013), 1818–1831.10.1016/j.engappai.2013.05.008Search in Google Scholar

[27] H. Masoud, S. Jalili and S. M. H. Hasheminejad, Dynamic clustering using combinatorial particle swarm optimization, Appl. Intell. 38 (2013), 289–314.10.1007/s10489-012-0373-9Search in Google Scholar

[28] T. Niknam and B. Amiri, An efficient hybrid approach based on PSO, ACO and K-Means for cluster analysis, Appl. Soft Comput. 10 (2010), 183–197.10.1016/j.asoc.2009.07.001Search in Google Scholar

[29] M. G. H. Omran, A. P. Engelbrecht and A. Salman, Dynamic clustering using particle swarm optimization with application in image segmentation, Pattern Anal. Appl. 8 (2006), 332–344.10.1007/s10044-005-0015-5Search in Google Scholar

[30] S. M. Pan and K. S. Cheng, Evolution-based tabu search approach to automatic clustering, IEEE Trans. Syst. Man Cybern. C Appl. Rev. 37 (2007), 817–838.10.1109/TSMCC.2007.900666Search in Google Scholar

[31] S. Quadfel, M. Batouche and A. Taleb-Ahmed, A modified particle swarm optimization algorithm for automatic image clustering, in: Proceedings of the IEEE International Conference on Digital Information Management, pp. 546–551, 2010.10.1109/ICDIM.2010.5664657Search in Google Scholar

[32] S. Saha and S. Bandyopadhyay, A symmetry based multiobjective clustering technique for automatic evolution of clusters, Pattern Recogn. 43 (2010), 738–751.10.1016/j.patcog.2009.07.004Search in Google Scholar

[33] S. Saha and U. Maulik, A new line symmetry distance based automatic clustering technique: application to image segmentation, Imaging Syst. Technol.21 (2011), 86–100.10.1002/ima.20243Search in Google Scholar

[34] S. Supratid and H. Kim, Modified fuzzy ants clustering approach, Appl. Intell. 31 (2009), 122–134.10.1007/s10489-008-0117-zSearch in Google Scholar

[35] C. -M. Wang and Y. -F. Huang, Self-adaptive harmony search algorithm for optimization, Exp. Syst. Appl. 37 (2010), 2826–2837.10.1016/j.eswa.2009.09.008Search in Google Scholar

[36] R. Xu and D. C. Wunsch II, Clustering, John Wiley and Sons, USA, 2009.10.1002/9780470382776Search in Google Scholar

[37] F. Ye and C. Chen, Alternative KPSO-clustering algorithm, J. Sci. Eng. 8 (2005), 165–174.Search in Google Scholar

Received: 2015-1-8

Published Online: 2015-10-13

Published in Print: 2016-10-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Automatic Data Clustering Using Parameter Adaptive Harmony Search Algorithm and Its Application to Image Segmentation

Abstract

1 Introduction

2 Background

2.1 Basics of Clustering Problem

2.2 Parameter Adaptive Harmony Search Algorithm

2.3 Related Works

3 Proposed Automatic Data Clustering Using PAHS

3.1 Mathematical Foundation

3.2 Proposed Automatic Data Clustering Algorithm

3.2.1 Solution Representation

3.2.2 Harmony Memory Initialization

3.2.3 Active Cluster Center Extraction

3.2.4 Cluster Center Validation

3.2.5 Fitness Function

4 Experimental Results and Discussions

4.1 The Datasets Used

4.2 Parameter Setting

4.3 Results and Discussions

4.4 Statistical Evaluation

5 Application to Image Segmentation

5.1 Image Segmentation as a Clustering Problem

5.2 Experimentation 1: Grayscale Image Segmentation

5.3 Experimentation 2: Color Image Segmentation

6 Conclusions

About the authors

Bibliography

Journal and Issue

Articles in the same Issue