Abstract

The clustering of mixed-attribute data is a vital and challenging issue. The density peaks clustering algorithm brings us a simple and efficient solution, but it mainly focuses on numerical attribute data clustering and cannot be adaptive. In this paper, we studied the adaptive improvement method of such an algorithm and proposed an adaptive mixed-attribute data clustering method based on density peaks called AMDPC. In this algorithm, we used the unified distance metric of mixed-attribute data to construct the distance matrix, calculated the local density based on K-nearest neighbors, and proposed the automatic determination method of cluster centers based on three inflection points. Experimental results on real University of California-Irvine (UCI) datasets showed that the proposed AMDPC algorithm could realize adaptive clustering of mixed-attribute data, can automatically obtain the correct number of clusters, and improved the clustering accuracy of all datasets by more than 22.58%, by 24.25%, by 28.03%, by 22.5%, and by 10.12% for the Heart, Cleveland, Credit, Acute, and Adult datasets compared to that of the traditional K-prototype algorithm, respectively. It also outperformed a modified density peaks clustering algorithm for mixed-attribute data (DPC_M) algorithms.

1. Introduction

Clustering analysis has been widely used in statistics, machine learning, pattern recognition, image processing, such as image inpainting [1, 2], image super-resolution reconstruction [3], and so on. Mixed-attribute data clustering is one of the research hotspots in data mining. There are many solutions to mixed-attribute data clustering, including attribute conversion methods, clustering ensemble methods, prototype-based methods, hierarchical clustering methods, and density clustering methods [4]. The K-prototypes algorithm proposed by Huang [5] and the iterative clustering learning based on object-cluster similarity metric (OCIL) algorithm proposed by Cheung and Jia [6] are both typical prototype-based methods. The similarity-based agglomerative clustering (SBAC) algorithm proposed by Li and Biswas [7] is a famous aggregation hierarchical clustering method. Density clustering algorithms include the relative density-based clustering algorithm for mixture datasets (RDBC_M) algorithm based on relative density proposed by Huang and Li [8] and the density-based clustering algorithm for mixed data with mixed distance measure methods (MDCDen) algorithm based on density and mixed-distance measurement proposed by Chen and He [9]. But the state-of-the-art methods require user intervention and parameter tuning, so they cannot realize adaptive clustering.

The density peaks clustering (DPC) algorithm proposed by Rodriguez and Laio [10] has attracted a great deal of attention from researchers in recent years [1114]. In this algorithm, a decision graph is constructed by calculating a local density ρi and a relative distance δi, and the number of clusters is determined by manually selecting the center points of the clusters in the decision graph. The remaining data points will be assigned to the cluster of the nearest higher-density neighbor. Theoretically, it can cluster data of both arbitrary shape and type and automatically identify outliers. The algorithm is efficient and has only one parameter, dc (cutoff distance), which determines the local density calculation. The input of the DPC algorithm is the distance matrix between data points, and as long as the distance measurement problem of mixed-attribute data is solved, the algorithm can be applied directly to cluster the mixed-attribute data. Therefore, Liu et al. [15] defined a distance measurement method for mixed attributes and improved the DPC algorithm to a modified DPC algorithm for mixed-attribute data (DPC_M) algorithm, which was successfully applied to mixed-attribute data clustering. Du et al. [16] defined a distance-measurement method between mixed-attribute data points by referring to the similarity in OCIL algorithm and used the DPC algorithm to perform clustering analysis on the numerical attribute, categorical attribute, and mixed-attribute data. These two algorithms prove the feasibility of the density peaks algorithm in the clustering of mixed-attribute data. But they are not adaptive and need manual intervention in the clustering process.

Adaptive algorithm is one of the most popular research fields [1719]. There are also many studies on adaptive improvement of DPC algorithm, but they mainly focus on clustering of numerical attribute datasets, which will be detailed in the next section. To realize the adaptive clustering of mixed-attribute data, we proposed an adaptive mixed-attribute data clustering method based on the DPC algorithm, called AMDPC. Experimental results showed that the proposed AMDPC algorithm had a better clustering effect, automatically determined the cluster number, and realized adaptive clustering of mixed-attribute datasets with no parameter.

For this paper, the main contributions are as follows:(1)The distance-measurement method of mixed-attribute data is studied, and a unified distance-measurement method is used to construct the distance matrix between data points of mixed-attribute data, to solve the problem of using the DPC algorithm to cluster mixed-attribute data.(2)The adaptive improvement of the DPC algorithm is studied and a new method to determine the center of the cluster is proposed. Because the cluster center is usually a data point with large local density and distance, after calculating γi = ρi×δi the cluster center is determined by calculating the inflection points of the sorted γi, ρi, and δi sets.(3)An adaptive local density calculation method based on K-nearest neighbor (KNN) is used to improve the robustness of the algorithm, without manually determining the cutoff distance dc and other parameters.

2.1. Density Peaks Clustering

The DPC algorithm is based on the following two assumptions: the cluster center point has a higher local density, which is surrounded by neighbor points with lower local density, and the cluster center point is relatively far from other denser data points. Therefore, the DPC algorithm constructs a decision graph by calculating a local density ρi and a relative distance δi to find the cluster center of a dataset. The remaining data points in the dataset will be assigned to the cluster with the nearest local density that is higher than its own.

Suppose that X = {X1, X2, …, Xn} is the dataset to be clustered consisting of n data points; we define the distance between the data points Xi and Xj as . A cutoff distance dc is defined in the DPC algorithm, and the local density ρi and the distance δi of each data point are defined, as shown in equations (1) and (2), where when dij − dc < 0 and 0 otherwise.

When the local density of Xi is not the maximum, the relative distance is the minimum value of the distance from this point to all points with higher density; otherwise, the relative distance is the maximum distance from this point to all other points.

When the dataset has few data points, the local density is generally calculated using a Gaussian kernel, as shown in

Based on the local density ρi and relative distance δi for each data point, users can explicitly choose the cluster centroids on the decision graph. Once the center point is determined, each remaining data point can be classified into the same cluster as its nearest neighbor with a higher density.

2.2. Adaptive Improvement of the DPC Algorithm

The original DPC algorithm has some drawbacks: the cutoff distance dc and cluster centroids selected manually have great influence on the clustering results, and the original method of local density calculation is not effective for data with different density clusters or different shapes. At the same time, the original sample allocation strategy will create a domino effect. Once a sample is misallocated, it will lead to a series of sample allocation errors, resulting in incorrect clustering results and a reduction in the reliability of the clustering results [20, 21]. Therefore, many adaptive improvement methods of the original DPC clustering algorithm have been proposed. The research on the adaptive improvement of the density peaks algorithm mainly focuses on the automatic determination of the cutoff distance dc, the calculation of adaptive local density, the design of adaptive distance measurement, and the automatic determination of cluster number (the selection of cluster centroids). Most of these studies are focused on the clustering of numerical attribute data.

2.2.1. Adaptive Improvement for Selection of Cutoff Distance and Local Density Calculation

Wang et al. [22] proposed a method to automatically extract the optimal value of the threshold in different kernel functions and different datasets from the original dataset by using the potential entropy of the data field. According to the characteristics of the dataset, Jiang et al. [23] used the change of the nearest-neighbor distance curve to automatically determine the density threshold dc and used this method to guide the clusters merging after the first clustering using DPC. This was done to solve the problem that the DPC algorithm will divide a cluster into multiple clusters when there are two or more density peaks in a cluster. Sun et al. [24] proposed an ADPC method with a Fisher linear discriminant. The Pearson correlation coefficient is first introduced as the weight, and then the kernel-density-estimation function based on the weighted Euclidean distance is used to calculate the local density between the samples. Lotfi et al. [25] proposed a novel dynamic density peaks clustering method based on density backbone and fuzzy neighborhood called DPC-DBFN in which a fuzzy kernel is proposed to compute the local densities of the data points. Parmar et al. adopted the residual error computation to measure the local density within a neighbourhood region and proposed residual error-based density peak clustering algorithm named REDPC [26, 27] and FREDPC [28].

Du et al. [29] proposed a DPC-KNN algorithm that introduced K-nearest-neighbor data to participate in the local density calculation. In addition, they also proposed an improved principal component analysis- (PCA-) based algorithm named the DPC-KNN-PCA algorithm for high-dimensional data clustering. Juanying et al. proposed KNN-DPC [20] and FKNN-DPC [21] algorithms, in which a uniform local density metric based on KNNs, fuzzy KNNs, and two new strategies for assigning the remaining points to their most likely clusters are proposed for both. Yaohui et al. [30] proposed an adaptive DPC algorithm (named ADPC-KNN), which introduced the idea of KNNs to calculate the global parameter dc and the local density ρi of each point, applied a new approach to automatically select the initial cluster centers, and finally aggregated the clusters if they were reachable in density. Shi et al. [31] presented an algorithm called the adaptive clustering algorithm based on KNN and density (ACND) that first determines the KNN of every data point and then redefines the similarity between pairs of points with shared nearest neighbors. It does not force the user to define parameter values, recognizes the core point and constructs the cluster around it, and then attempts to detect the clustering boundary. It makes full use of the effect of KNN, and it has low computational complexity and can deal with different shapes as well as different data sizes with noise and outliers. Xu et al. [32] proposed extended adaptive density peaks clustering (EDAP) for overlapping community detection in which the local density is calculated based on KNN. Jiang et al. [33] proposed a method called G-KNN-DPC to calculate the cutoff distance based on the Gini coefficient and KNN. Sun and Liu [34] proposed a new density formula combined with the idea of gravitation and KNN that can make the local densities of sample points in dense and sparse areas have more obvious separability. Fan et al. [35] proposed a new DPC algorithm by incorporating an improved mutual K-nearest-neighbor graph (Mk-NNG) into DPC.

In general, KNN is used for local density calculations in most of these improved algorithms. Letting d(Xi, Xj) be the Euclidean distance between the ith and jth data points in the dataset X = {X1, X2, …, Xn}, the local density calculation formula defined by DPC-KNN is expressed byand the local density calculation formula defined in the KNN-DPC and FKNN-DPC algorithms is expressed as follows:where KNN(Xi) represents the KNN set of data point Xi. In general, k takes a fixed value of 5 or 6, or is calculated according to the percentage of data points in the dataset. In most cases, , where the percentage p = 2, N is the total number of data points in the dataset, and is a ceiling function. Most of these algorithms are based on equations (4) and (5) or variants.

2.2.2. Automatic Determination of Cluster Number

To solve the problem that the density peaks algorithm must manually select the cluster center, Ma et al. [36] introduced the weight of the cluster center. First, the products γ (γi = ρi×δi) of the normalized adjacent distance δi and the local density ρi were calculated. Then, the inflection point of γ was used to determine the cluster center of the dataset to avoid the subjective difference of users’ selection of the cluster center. Zhao [37] proposed an improved LDPC algorithm combined with the linear fitting method. In this algorithm, the sparse and dense points are separated by the linear fitting method, and then the residual sequence C is obtained by making a difference between the original γs and the fitting value γr. The average residual value of the first 20 points is selected as the threshold value, and the data points with residual values greater than the threshold are the central points. Du et al. [38] proposed a parameter-adaptive clustering algorithm named DDPA-DP. The data-driven thought goes through the design of DDPA-DP: at first, a series of fitted curves are established to automatically detect points’ roles by points’ density attributes instead of any artificial thresholds; meanwhile, a new point’s role “pending point” is defined, and then by the change of pending points’ number, the local field’s radius can be adaptively optimized. García-García and García-Ródenas [39] proposed an optimization-based methodology for automatic parameter/center selection that uses the internal/external cluster validity index as the objective function.

Wang et al. [40] proposed an efficient hierarchical clustering algorithm based on density peaks, used the step characteristics of the parameter γ to distinguish different levels of clustering, and then constructed a hierarchical clustering tree based on the intermediate result of DPC (NNeigh, a DPC array) to complete efficient hierarchical clustering and determined the cluster number automatically. Zhang and Li [41] extended the traditional DPC algorithm by using the CHAMELEON hierarchical clustering algorithm. The DPC algorithm was used for the initial clustering in the extended algorithm, and then the hierarchical clustering algorithm was used to merge the subclasses for the clustering results, and the effect was improved. Bie et al. [42] proposed a fuzzy DPC algorithm called Fuzzy-CFSFDP that uses fuzzy rules to find all density peaks and treats each peak as a local cluster, and then merges the close local clusters into a global cluster to achieve the final cluster. Ding et al. [43] proposed an improved density peaks clustering based on a natural neighbor expanded group (DPC-NNEG). They first define a natural neighbor expanded (NNE) and a natural neighbor expanded group (NNEG) and then divide all NNEGs into a target number of sets as the final clustering result according to the degree of closeness of the NNEGs. To describe the clustering center more comprehensively, Diao et al. [44] redefined the local density and relative distance and distance attributes of the two neighbor relationships (KNN and SNN) as fused. This method can detect the low-density clustering center. Mehrmohammadi et al. [45] proposed a better method for selecting centers based on the mutual kNN graph and the shortest path. Fang et al. [46] proposed adaptive core fusion-based density peak clustering (CFDPC) to detect clusters in any shape and density adaptively. An initial clustering based on automatic finding of density peaks is proposed first. An adaptive search approach is then proposed to find the core points and a core fusion strategy based on similarity within the cluster is proposed to obtain the final clustering results.

In summary, the main improvement ideas of automatic determination of cluster number can be categorized as following two directions. One is to determine the cluster center by taking a larger value of γ, ρ, and δ, such as finding the inflection point of γ or using curve fitting and residual analysis. The second is to adopt the idea of hierarchical clustering, initially selecting more clustering centers and then merging the close local clusters.

Notably, γ, ρ, and δ are discrete sequences, and for the calculation of the inflection point of the discrete sequence, Ma et al. [36] used the slope of the line segment at two points to represent it. The calculation formula is expressed as follows: represents the average change rate of the discrete sequence in the interval [i, i + m], namely, the slope change of y in the interval [i, i + m].

Based on the slope calculation, the inflection point is defined as follows:where is the slope from the ith point to the ith +1 point, is the slope from the first point to the ith point, and represents the average change rate of the discrete sequence y in the interval [1, i]. In this case, the inflection point is the critical point with the fastest slope change.

3. Adaptive Mixed-Attribute-Data Density Peaks Clustering

3.1. Definition of Unified Distance Metric of Mixed-Attribute Data

Suppose that DS = {X1, X2, …, Xn} is a mixed dataset with d dimensions and n instances, which contains dr dimensional numerical attributes and dc () dimensional categorical attributes, for two instances Xi and Xj in the dataset; their distance is defined as D(Xi, Xj) as follows:

Equations (9) and (10) illustrate the distance computation of the numerical attribute and that of the categorical attribute , respectively:where denotes the normalized Euclidean distances of the numerical attribute of the data points . Because the Euclidean distance is non-negative, it is ensured that the distance value of the numerical attribute is in the interval [0,1]. Regarding the distance of the categorical attribute, the matching method with the entropy weight is used. The matching distance of the data point in the tth categorical attribute is calculated by

The importance of a categorical attribute is quantified by its average entropy on each attribute value. The weight of each attribute is then computed by

Assume that the total number of categorical values on the tth categorical attribute is , where the probability of occurrence of the sth (s = 1,2, …, ) values is . The entropy weight can be calculated using equation (13); it represents the average entropy of values of tth classification attribute:

Assuming a mixed-attribute dataset about the weather record is shown in Table 1, the dataset DS = {X1,X2,X3,X4,X5} has five records X1X5 and four attributes A1A4. The four attributes represent weather, windy, temperature, and humidity: the first two attributes weather and windy are categorical attributes and the last two are numerical ones. Here dr = 2, dc = 2. Let us look at the calculation process of the unified distance metric.

Firstly, it is necessary to normalize the numerical attributes A3 and A4; the results are A3 = [1.0000,0.9615,0.9231,0.6154,0]T, A4 = [1,0.5,1,1,0]T, Then formulas (12) and (13) are used to calculate the weights of the first and second dimensions as and , respectively. Finally, the distance between the first record and other records can be calculated according to formula (8) as follows:

3.2. Local Density Calculation Based on KNN

In a small dataset, the Gaussian kernel function is usually used to calculate the local density, which requires manually setting the density threshold parameter dc. As mentioned above, to adaptively calculate the local density, many studies have adopted KNN information to improve the calculation of the local density. We adopt the idea of the DPC-KNN algorithm and use the improved Gaussian kernel function of KNN information to calculate the local density of each data point. Using the KNN set of data points, we can calculate the average of the sum of squares of distance between each data point and KNNs. Thus, equation (4) can be used to calculate the local density of data points. In this calculation method, it is not necessary to set the cutoff distance parameter dc, but rather to determine the nearest-neighbor number K. Through subsequent experiments, the nearest-neighbor number K can be automatically determined based on the data points in the dataset.

3.3. Automatically Determining the Cluster Number

To realize the automatic determination of the cluster number, the method of calculation of the inflection point of γ proposed by Ma et al. [36] is simple but not sufficiently accurate. In theory, the center points should be points with large local density ρi and large relative distance δi, and the product of the two γi does not fully guarantee that the local density and the relative distance are both large.

From the sample dataset of the DPC algorithm and its decision graph [10] in Figure 1, points 1 and 10 in the upper right corner of the decision graph are cluster centers for which the local density and relative distance are both large. Points 26, 27, and 28, however, are treated as outliers for which the relative distance is large, but the local density is small. Therefore, we presented a three-inflection-point improvement method, which is based on equations (2) and (4) to calculate the local density and distance of the data points. We then sorted the γ, ρ, and δ values of each data point by descending order and used equation (7) to calculate the inflection point of γ, ρ, and δ, and to obtain three candidate sets , Sp, and Sd according to three inflection points of γ, ρ, and δ, respectively. The candidate set contains the points with γ values that are larger than those of the inflection point of γ. Similarly, the candidate set Sp contains the points for which the value of ρ is larger than that of the inflection point of ρ, and the candidate set Sd contains the points with δ values larger than those of the inflection point of δ. Then we calculated the intersections Sc =  ∩ Sp ∩ Sd, and Sc is the cluster center set. Points for which the relative distances are larger, but not the cluster center, can be judged as the outliers, which can be obtained by calculating So = Sd − Sc. Therefore, the improved method proposed in this paper could automatically identify the cluster center and outlier point.

For example, the local density ρi, relative distance δi, and γi of partial data points of a sample dataset are shown in Table 2. According to equation (7), we can calculate Sd = {1,2,3,4,5,6,8}, Sp = {1,2,3,4,5,6,7,10}, Sd = {1,2,3,4,5,6,7}, and then Sc =  ∩ Sp ∩ Sd = {1,2,3,4,5,6}; it contains the cluster centers. At the same time, we can get the outlier So = Sd − Sc = {8}.

The three-inflection-point algorithm to determine the center of the cluster is described as Algorithm 1.

   Input: rho, delta (represent local density vector ρ and relative distance vector δ)
   Output: Sc (set of cluster centers Sc)
(1)//Step 1. Calculate .
(2)for i = 1 to length(rho) do
(3)  gamma(i) = rho(i)  delta(i)
(4)end
(5)//Step 2. Sort rho, delta, gamma (ρ, δ, γ) in descending order:
(6)Sorted_rho = sort (rho, “descend”);
(7)Sorted_delta = sort(delta, “descend”);
(8)Sorted_gamma = sort(gamma, “descend”);
(9)//Step 3. Calculate the inflection point of rho, delta, gamma using equation (7) separately and construct the three candidate sets , Sp, and Sd.
(10)Sp = calcinflection(Sorted_rho);
(11)Sd = calcinflection(Sorted_delta);
(12) = calcinflection(Sorted_gamma);
(13)//Step 4. Calculate the intersections of the three sets and return the result Sc.
(14)Sc = intersection(Sp,Sd,).
3.4. AMDPC Implementation

First, we used the unified distance measurement of the mixed-attribute data to calculate the distance matrix of the mixed-attribute dataset according to equation (8). Then, we calculated the local density ρi of each data point using the KNN equation (4) and calculated the distance δi using the method of equation (2); thus, γi = ρi×δi is calculated and the cluster centers are found using Algorithm 1. Finally, the remaining points could be clustered by finding the nearest local point with densities higher than it and setting the clustering label to be consistent with its nearest-neighbor point with high density. The overall flow diagram of the AMDPC algorithm is shown in Figure 2.

The input of the algorithm is the mixed-attribute dataset (DS) and the output is the cluster label vector (CL). The detailed process of the AMDPC algorithm is as Algorithm 2.

   Input: DS (the mixed-attribute dataset)
   Output: CL (cluster label vector)
(1)//Step 1. Load the dataset DS and separate it into numerical subset Dr, categorical subset Dc and the ture label subset.
(2)[Dr,Dc,truelabel] = loadseparate(DS);
(3)//Step 2. Calculate the distance and construct the distance matrix of the mixed-attribute dataset DS according to equation (8).
distmatrix = distamdpc(Dr,Dc);
(4)//Step 3. Calculate the local KNN density ρi of each data point according to equation (4) and calculate the relative distance δi according to equation (2).
(5)rho = kNNrho(distmatrix);
(6)delta = calcdelta(distmatrix);
(7)//Step 4. Run Algorithm 1 to obtain the cluster center points and set each point a different label.
(8)Sc = findClusterCenter(rho,delta);
(9)//Step 5. Assign the class label for center and non-center points using original DPC method according to the Sc.
(10)//Step 5.1. Initialize the class label vector CL.
(11)NCLUST = 0;
(12)for i = 1 to number of datapoints
(13)  CL(i) = −1;
(14)End
(15)//Step 5.2. Assign the class label for center points
(16)for j = 1 to sizeof(Sc)
(17)  NCLUST = NCLUST + 1;
(18)  CL(Sc(j)) = NCLUST;
(19)End
(20)//Step 5.3. Assign the class label for non-center points
(21)for k = 1 to number of datapoints
(22)  if (CL(ordrho(k)) = = −1)
(23)    CL(ordrho(k)) = CL(nneigh(ordrho(k))); //assign the non-center data points to the cluster with the nearest local density that is higher than its own.
(24)end
3.5. Complexity Analysis

For datasets with n data points, the space complexity of the algorithm is mainly from the storage of distance matrix. According to the input demand of DPC algorithm, storage space is needed. Columns 1 and 2 are the data point numbers and column 3 is the distance between the two data points. In addition, the algorithm requires three arrays of length n to store the local density ρ, distance δ, and its product γ, so the space complexity is .

The time complexity of the AMDPC algorithm is mainly derived from distance calculation in Step 2 and the local density computation in Step 3. The time complexity of distance computation and its product calculation is . The sort time complexity in Step 4 (Algorithm 1) depends on the sorting algorithm, the minimum , and the largest , so the total complexity is no more than . The time complexity of the data point allocation in Step 5 is O(n). Therefore, the overall complexity of the algorithm is , and it is the same as the DPC algorithm.

4. Experimental Analysis

To verify the effectiveness of the AMDPC algorithm in this paper, we used several mixed datasets from the University of California-Irvine (UCI) for experimental study. We compared the clustering results of the AMDPC algorithm with those of the K-prototype and DPC_M algorithms.

We implemented the three algorithms in MATLAB 2015a (MathWorks, USA) running on Windows 10 on a laptop with Intel Core i5-5200u model CPU and 4 GB of DDR3 memory.

4.1. Experimental Datasets

In this study, we investigated four datasets of mixed datasets from the UCI machine-learning repository, namely, Statlog Heart, Cleveland Heart Disease, Statlog Credit Approval, and Acute Inflammations. Brief information describing these datasets is shown in Table 3.

The Acute Inflammations dataset contains pathological and physiological indicators for 120 patients with acute inflammation. There is one numerical attribute (body temperature) and five categorical attributes (different symptoms) to determine whether each patient has cystitis and nephritis. There are two class labels to represent the two diseases. We used the first to predict cystitis in our experiments. The deletion of missing data in the dataset did not affect the result of clustering analysis. Therefore, we eliminated 6 instances with missing values in the Cleveland dataset and 37 instances with missing values in the Credit dataset before the experiment. The Adult dataset was extracted from the census bureau database, which contain 30162 training instances. We selected 3000 of them by random sampling. In addition, we normalized the numerical properties using the maximum-minimum normalization method.

4.2. Effectiveness Analysis

We used the K-prototype algorithm, DPC_M, and the proposed AMDPC algorithm to separately cluster the dataset described in Section 4.1. According to the research in [5], the parameter γ of the K-prototype algorithm was 1/2σ (σ represents the average standard deviation of the numerical attributes). The K-prototype algorithm ran 100 times and the clustering results were averaged. In the DPC_M algorithm, the percent parameter p = 2, as described in [15]. When the AMDPC algorithm calculated the local density, the parameter K was assigned as ; that is, 10% of the data points were taken as the nearest neighbors.

Because the UCI datasets have real class labels, the clustering accuracy rate (ACC) can be used as the validity index. We also used the normalized mutual information (NMI), Rand index (RI), adjusted Rand index (ARI), and F-score as validity indexes. For all indexes, the higher the index values, the better the clustering effect. The optimal results are indicated in bold in Tables 48.

Accordingly, we observed that the performance of the AMDPC algorithm was much better than that of the traditional K-prototype algorithm. The AMDPC algorithm improved the clustering accuracy of all datasets by more than 22.58%, by 24.25%, by 28.03%, by 22.5%, and by 10.12% for the Heart, Cleveland, Credit, Acute, and Adult datasets, respectively. It also outperformed the DPC_M algorithm in the first four datasets as shown in Tables 47. In the Adult dataset, the clustering accuracy of the AMDPC algorithm was 0.43% worse than that of the DPC_M algorithm, but it is better than the DPC_M algorithm in the NMI index and the ARI index. The F-score takes into account both precision and recall; the value of F-score shows that different algorithms perform differently in different experimental datasets. The proposed AMDPC algorithm got the best performance in Credit dataset.

As shown in Table 7, for the first four indexes, the DPC_M algorithm had two different results because of the different selection of center points. The DPC_M2 was worse than the AMDPC algorithm in the clustering effect, whereas the DPC_M1 was better. This showed that the selection of the center point in the DPC algorithm had a significant influence on the clustering effect, whereas the AMDPC algorithm automatically determined the center point of the clustering, and the algorithm ran more stably. The decision graphs for center selection by the two aforementioned clustering types using the DPC_M algorithm (that is, DPC_M1 and DPC_M2) are shown in Figure 3.

4.3. Parameter-Adjustment Experiment

The AMDPC algorithm uses the improved Gaussian kernel function of the KNN information to calculate the local density of each data point. To understand how the parameter K affects the effectiveness of the algorithm, we conducted a series of experiments and found that the best effect was obtained when K was approximately 10% of the data instances in the dataset.

Taking the Heart dataset as an example, we had 270 data points in total. We took K as 1–20% of the data points to calculate the clustering accuracy of the AMDPC algorithm. The results are presented in Table 9 ; optimal results are indicated in bold).

As shown in Table 9, when K was 10% of the data points (K = 27), the clustering accuracy reached the best value. As shown in Figure 4, in the Heart and Cleveland datasets, K took 10% of the data points to achieve the best effect. In the Credit and Acute datasets, some values of K would have led to the incorrect clustering number, so the value of clustering accuracy (ACC = 0) was marked “not available” in the graph. For Acute dataset, K = 4 or 5 was the best, and 10% was also good. Therefore, we determined that the value of K in the AMDPC algorithm is 10% of the data points in the dataset.

4.4. Computational Complexity Experiment

To verify the time complexity of the proposed algorithm, we calculated the running time of the above three algorithms. The K-prototype algorithm was run 100 times and the running times were averaged. The running times are shown in Table 10.

As shown above, the K-prototype is the most efficient algorithm. The proposed AMDPC needs more time to calculate distance and compute local density. With an increase in data volume, the time consumption of the AMDPC algorithm and the DPC_M algorithm presents a linear relationship, and the time complexity of the two algorithms is of the same order of magnitude, which is consistent with the previous theoretical analysis. As shown in Table 11, when there are 120 points in the Acute dataset, the clustering time used by AMDPC is 3.6 times that of the DPC_M algorithm. When the data amount increases to 653 points (in Credit), the clustering time used by AMDPC is about 6 times that of DPC_M algorithm. When the data amount increases to 3000 points (in Adult), the clustering time used by AMDPC is less than 2 times that of DPC_M algorithm.

5. Conclusion

The DPC algorithm is a simple and efficient algorithm. As long as the distance-measurement problem of data points in the mixed-attribute dataset is solved, the DPC algorithm also can be used for efficient clustering of mixed-attribute data. In this paper, we study the clustering methods of mixed-attribute data, focusing on the DPC algorithm and its adaptive improvement. Accordingly, we proposed an adaptive mixed-attribute data clustering algorithm based on DPC called AMDPC that adopted a unified mixed-attribute distance-measurement method and KNN adaptive local density calculation method. We used three inflection points to calculate the cluster center set and automatically determined the clustering number, which realized adaptive clustering of mixed-attribute datasets. From the analysis of experimental results, the proposed algorithm was significantly superior to the traditional K-prototype and DPC_M algorithms. In all five datasets, the clustering accuracy of the AMDPC algorithm is significantly improved compared with that of the K-prototype algorithm, by 10.12% to 28.03%, and also slightly improved compared to the DPC_M algorithm except in the Adult dataset. In addition, AMDPC implements adaptive clustering without manual adjustment of any parameters.

The AMDPC algorithm could realize adaptive clustering of mixed-attribute data well. When we used KNN to calculate the local density of the data points, the determination of K was different from the value in the previous research paper [10, 14], and the value of K also had a significant influence on the effect of cluster. According to the experimental analysis, the effect was optimal when K was 10% of the data points, but there was still room to adjust the value of K on different datasets, which requires further research. There are still many problems in adaptive clustering of mixed-attribute data to be further studied, such as mixed-attribute data clustering on the datasets containing a huge number of objects or a huge number of attributes, or on the datasets with arbitrary shapes, different sizes, variable density, and overlapping clusters, etc.

Data Availability

Data used to support the findings of this study are available from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/index.php, https://github.com/milaan9/Clustering-Datasets.

Conflicts of Interest

The author declares no conflicts of interest.

Acknowledgments

The author thanks LetPub (https://www.letpub.com) for its linguistic assistance during the preparation of this manuscript. This research was supported by the Qingshan Lake Science and Technology City Joint Fund of Zhejiang Provincial Natural Science Foundation of China under Grant no. LQY19F020001 and the Wenzhou Polytechnic Major Scientific Research Projects (Grant no. WZYSDCY2018002).