Abstract

Clustering algorithm is one of the important research topics in the field of machine learning. Neutrosophic clustering is the generalization of fuzzy clustering and has been applied to many fields. This paper presents a new neutrosophic clustering algorithm with the help of regularization. Firstly, the regularization term is introduced into the FC-PFS algorithm to generate sparsity, which can reduce the complexity of the algorithm on large data sets. Secondly, we propose a method to simplify the process of determining regularization parameters. Finally, experiments show that the clustering results of this algorithm on artificial data sets and real data sets are mostly better than other clustering algorithms. Our clustering algorithm is effective in most cases.

1. Introduction

With the increasing development of information technology, the data dimensions on the Internet have increased exponentially. For example, dimensions of various documents, multimedia, and gene expression data can reach hundreds of thousands. Facing these data, scholars have proposed many data processing methods [13].

In 1965, Zedah [4] proposed the concept of fuzzy set. Fuzzy theory is applied in many areas, such as multiattribute decision-making [57], image processing [8], and cluster analysis [9]. In particular, fuzzy clustering has made considerable progress in the past few decades. Based on fuzzy sets, FCM [10] algorithm is proposed. The quality of the clustering results is good, but there are still some problems for uncertainty problems. Therefore, in recent years, scholars have devoted themselves to propose a variety of methods to improve the fuzzy c-means algorithm of various aspects. Hwang [11] et al. combined the type-2 fuzzy set with the FCM (T2-FCM) clustering algorithm and made an improvement on the uncertainty that affects the final class c classification. Linda [12] et al. improved the general type-2 fuzzy set fuzzy c-means (GT2-FCM) algorithm through the alpha surface representation theorem, described the ambiguity in linguistic terms, and transformed the uncertainty of the language into the uncertain fuzzy positions of the extracted clusters. The algorithm [12] works well when there are noisy samples or insufficient training samples. The T2-FCM and GT2-FCM algorithms are all improved for the uncertainty of fuzzy c-means algorithm.

In 1986, Atanassov [13] proposed the concept of intuitionistic fuzzy sets, which solved some of the drawbacks of traditional fuzzy sets, and is more capable of processing uncertain information. Chaira [14] et al. introduced intuitionistic fuzzy entropy into the traditional fuzzy c-means algorithm, and the new algorithm proposed was used to cluster CT brain scan partial images, which can identify brain abnormalities. Bukiewicz [15] et al. introduced a variable to deal with the uncertainty and similarity measurement between intuitionistic fuzzy sets in the fuzzy c-means algorithm and proposed a data set fuzzy clustering method based on the intuitionistic fuzzy set theory. Zhao [16] et al. constructed the corresponding lambda cutting matrix by calculating the correlation coefficient on the intuitionistic fuzzy set and then clustered on the cutting matrix. Cuong [17] proposed the concept of the picture fuzzy set (PFS), which is a direct extension of a fuzzy set and intuitive fuzzy set. Thong [18] proposed an image fuzzy clustering algorithm based on image fuzzy sets. The algorithms proposed in literatures [1418] have better clustering performance than the traditional general algorithm, but they have certain limitations in the application. The generated membership matrix does not have sparseness, which will increase the amount of calculation.

In view of the limitations of the intuitive fuzzy sets, Smarandache [19] proposed the neutrosophic set theory. The basic idea is that everything can be described in three degrees of truth, uncertainty, and distortion. Each object has three degrees of membership function. Each membership function belongs to the standard and nonstandard subsets of . The neutrosophic set theory can not only describe the uncertainty problems better but also solve the existing problems when applying fuzzy theory. Therefore, scholars have done in-depth research on neutrosophic set [2024] and proposed many neutrosophic clustering algorithms. Ye [25] proposed a single-valued neutrosophic minimum spanning tree (SVNMST) clustering algorithm, which shows great advantages in the clustering of single-valued neutrosophic observation data. In the same year, Ye [26] proposed single-valued neutrosophic clustering methods based on similarity measures between single-valued neutrosophic sets (SVNSs). Guo [27] proposed neutrosophic c-means clustering algorithm (NCM). The NCM algorithm can calculate certainty and uncertainty, and the membership function is not affected by noise. Nowadays, neutrosophic clustering has been applied to many fields such as image segmentation and biology [2832]. PFS is a standardized form of neutrosophic set. The FC-PFS algorithm proposed in [18] is actually a kind of neutrosophic set type algorithm. However, the algorithm needs to calculate three matrices of the same scale, and the membership matrix is not sparse, which affects the clustering effect to a certain extent.

In order to solve the abovementioned problems, this paper proposes a new algorithm sparse neutrosophic fuzzy clustering algorithm (SNCM). The main idea is to introduce a regularization term into FC-PFS algorithm. The new algorithm can produce sparsity, since it reduces the number of eigenvalue vectors of the sample. Thus, SNCM reduces the complexity of the model. Experiments show that the performance of the proposed algorithm is better than some other clustering algorithms. The experimental results produce a sparse membership matrix, which reflects the effectiveness of the algorithm. The specific arrangements for the rest of this article are as follows.

The second section introduces the related basic concepts and algorithms, the third section presents the new algorithm proposed in this article and the solution process, the fourth section proves the effectiveness of the proposed algorithm through related experiments, and the fifth section gives relevant conclusions.

In this paper, the data set contains n data points, each point is a d-dimensional feature vector; the purpose of clustering is to obtain c clusters. The following introduces some clustering algorithms FCM and FC-PFS.

2.1. FCM Algorithm

The FCM algorithm proposed in 1984 is a very well-known algorithm. It is not only used in fuzzy engineering but also popular in the fields of medical diagnosis and communication. The FCM algorithm divides each data point into a specific cluster , means the i-th data point belongs to the membership value of the j-th cluster. The cluster center of the cluster is expressed as , and the objective function of the FCM algorithm iswhere m is a fuzzy parameter and the constraint condition of formula (1) is as follows:

Using Lagrangian multiplier method, the iterative method of membership degree and cluster center is obtained:

Until the number of iterations reaches the maximum value or , the iteration terminates, where and are the objective function values of the t and t − 1 iterations, and are the termination thresholds, generally in the range of (0, 0.1). According to the fuzzy membership value, if , then is divided into j-th cluster. It can be proved that the algorithm finally converges to the local optimum or the saddle point of the objective function.

2.2. FC-PFS Algorithm

Definition 1. A picture fuzzy set of nonempty set X iswhere is the degree of positive membership of each in A, is the degree of neutral membership of x in A, and is the degree of negative membership of x in A, and it satisfies the following conditions:The refusal degree of an element is calculated as

Definition 2. X is an object (point) set, x is an element in X, and the neutrosophic set A on X can be expressed aswhere is the truth membership degree, is the indeterminacy membership degree, and is the falsity membership degree, which belongs to the standard and nonstandard subset of , i.e., . Because there is no restriction on the sum of , there is .
From the abovementioned two definitions, it can be seen that the picture fuzzy set is actually the standard form of the neutrosophic set. Therefore, the FC-PFS algorithm proposed by Thong and Son is based on the neutrosophic set. The objective function of the algorithm isAmong them, , , and are the true membership degree, refusal membership degree, and neutral membership degree of the data points belonging to the j-th cluster, respectively. The constraints of formula (8) areUsing the Lagrangian multiplier method, the iterative method is adopted to obtain the update formulas of , , , and :The iteration is terminated until the number of iterations reaches the maximum or .

3. Sparse Neutrosophic Clustering Algorithm

3.1. Determining the Objective Function

In traditional k-means clustering, each row of the membership matrix U contains a 1, and the remaining c − 1 elements in this row are 0, so the row sum of U is 1, and each column sum represents the number of sample points in each cluster, and the fuzzy c-means algorithm needs to choose the appropriate fuzzy degree m. Different from the abovementioned three clustering algorithms, the algorithm in this paper relaxes each element of U to a nonnegative value less than 1 under the constraint conditions and presets the ambiguity m= 1. Our goal is to get a sparse U, so we introduce regular terms to get the objective function of the new algorithm:

The abovementioned formula satisfies the following constraints:

We can see that if the sample point is divided into a single cluster, is equal to 1. Otherwise, it is a nonnegative value less than 1.

The new algorithm considers the sparsity of the membership degree of each sample point assigned to different clusters in the clustering process. In the process of minimizing equation (11), the importance of each part is controlled by the parameter . If the parameter is zero, the membership vector of each sample is not sparse. If the parameter size is constantly adjusted, the sparsity of the member vector will also change. As the parameter gradually increases, the membership vector contains more and more nonzero elements. When the maximum value is reached, all elements of the membership vector are not zero, and the membership vector is nonsparse at this time. Therefore, this parameter controls the sparsity of the membership vector. We will give a method to determine the appropriate parameters in the subsequent part to obtain more accurate clustering results.

3.2. The Proposed Model and Solutions

Solve the abovementioned model using alternating iteration method. First, fix the variable to find the cluster center V. The derivative of (11) in V is

By considering , we have

Solve U with fixed V, ξ, and η. In order to facilitate the solution, we make the following deformation of the objective function:where is an element of matrix S, is the i-th row of matrix S, is an element of distance matrix D For each , problem (17) can be divided into n subproblems:

Then, (18) is written in the following vector form:

By solving problem (19), the solution of S can be obtained, and the update formula of U can be further obtained

The specific solution process for problem (19) is given in Section 3.3. Fixed variables , use the Lagrange multiplier method to solve :

We use the function L to derive to make it equal to zero, that is,

Finally, using the similar technique of Yager [33] to generate operators, we modify the hesitation of the intuitionistic fuzzy set to obtain the value of element rejection degree by replacing with , as follows:

3.3. Optimization Method for

In specific practice, the regularization parameter in question (19) is difficult to determine, and its value can be from zero to infinity. In this section, a method for determining the regularization parameter is given. The Lagrangian function of question (19) iswhere and are greater than zero and are Lagrange multipliers.

According to the KKT condition, the optimal solution of is the following form

In practice, if we focus on the locality of the data, usually we can get better performance. Therefore, it is best to learn a sparse . Another advantage of learning sparse matrix S is that it can greatly reduce the computational burden of subsequent processing. Without loss of generality, it is assumed to be sorted from small to large. If the optimal only has k nonzero elements, then according to equation (30), we know and . So, we have

According to equation (30) and constraint , we have

According to equations (38) and (39), we have an inequality of

Therefore, in order to obtain the optimal solution of problem (19) with precise k nonzero values, we can make

Taking the average of , the calculation formula is as follows:

Equation (35) gives a method to determine the regularization parameters.

According to equations (31), (33), and (35), the following optimal solution can be obtained,

3.3.1. Sparse Neutrosophic Clustering Algorithm
(i)Input: data set X, number of clusters c, and parameters and k(a)Initialization: set t= 0, random initialization meets the restriction condition (12)(b)Set iteration:(ii)for i= 1, 2, … maxSteps do(c)Update V(iii)calculate by equation (16)(d)Update U(iv)for j = 1, 2, …, n do(v)calculate S by equation (36)(vi)end for(vii)By solving problem (19), calculate by equation (18)(e)Update (viii)Calculate by equation (27)(f)Update (ix)Calculate by equation (28)(g)Set the conditions for jumping out of the iteration or (x)end for(xi)Output: clustering result y

Below, we analyze the algorithm complexity. First, we analyze the time complexity of the algorithm. From the algorithm steps, the basic sentence of the algorithm is the loop body of the algorithm iterative calculation variable, and the loop body for calculating u is embedded, so the time complexity of the algorithm is O (nt), t is the number of iterations and n is the number of sample points. Secondly, the space complexity of the analysis algorithm is related to the data scale, so the space complexity is O (nm), n is the number of sample points, and m is the dimension.

4. Results and Discussion

In order to verify the feasibility of the clustering algorithm SNCM proposed in this paper, classic clustering algorithms are selected: FCM [10], K-means [34], Ncut [35], Rcut [36], FC-PFS, and an effective clustering method based on data indeterminacy in neutrosophic set domain (INCM) [37], as comparison algorithms. A variety of evaluation indicators such as accuracy (ACC) and normalized mutual information (NMI) are used to evaluate the clustering results.

In terms of parameters, due to the instability of the K-means, FCM, and FS-PFC, a method of averaging them is adopted for 50 runs. For the Rcut and the Ncut, the experiment used the widely used self-tuning Gaussian method to construct the affinity matrix (the value is self-tuning). Take 0.9 for the parameter in FC-PFS algorithm. The parameter values in INCM algorithm are the best values found in literature [37]. The parameter in SNCM algorithm is 0.9, the value of parameter k is self-adjusted, and maxSteps is 1000.

In terms of experimental environment, all the experimental environments in this article are Microsoft Windows 10 system, the processor is Intel(R) Core(TM) i5-7200U CUP @ 250 GHz 2.70 GHz, memory 8.00 GB, programming software used is MATLAB R2016a.

4.1. SNCM Algorithm Descriptions

First, we illustrate the process of the proposed algorithm SNCM clustering the WBC data set; at this time, n= 683 and c= 2. The initial membership matrix, uncertainty matrix, and rejection matrix are as follows:

The distribution of data points according to these initializations is illustrated in Figure 1(a) in which the SNCM algorithm is used to calculate the cluster centers using equation (19):

Then, we calculate the new membership matrix, uncertainty matrix, and rejection matrix:

According to the abovementioned matrix, the calculated value of is greater than , so the iterative step will continue. Figure 1(b) shows the distribution of clusters after the first iteration.

Through a similar process, we continue to calculate the cluster center, membership degree, uncertainty degree, and rejection matrix until the stopping condition is met. The final membership degree, hesitation degree, and rejection degree matrix is as follows:

The calculated final cluster centers are expressed as follows, and the distribution of clusters and cluster centers is shown in Figure 1(c):

4.2. Verification of Sparsity

First of all, experiments are carried out using artificial aggregation data sets and real Wine data sets. The aggregation data set is a data set composed of 7 clusters of 788 2-dimensional data points. The Wine data set is a data set composed of 3 clusters of 178 12-dimensional data points. The parameter k satisfies . The goal of the experiment is to show that the membership matrix generated by the SNCM algorithm which is sparse compared to the FCM algorithm. Due to the large number of sample points, it is inconvenient to present the complete membership matrix in the article, so we select some sample points for display. Tables 14 are the membership matrix results obtained by the SNCM algorithm and the FCM algorithm on the two data sets. It can be seen from the experimental results that the SNCM algorithm effectively reduces the complexity of the model.

Next, we perform experiments on the artificial data set. Figures 2(a) and 2(b) show the distribution of the two data sets, where data set (a) has four clusters and data set (b) has three clusters. Clustering is performed using the proposed algorithm, and the clustering results and the weighted connection graph are shown in Figures 2(c)2(f). Figures 2(d) and 2(f) use the final degree of membership as the connection weight between the data point and the cluster center. The data point is connected to the cluster center. It can be seen that the points within the cluster are closely connected to the cluster center, and the points between the clusters are separated from the cluster center. It is separated, so the proposed algorithm can effectively cluster the aforementioned data sets and can effectively divide clusters with few categories.

4.3. Real Data Set

In addition, WBC, Vote, Dermatology, Dnatest, Pima, Vowel, TOX-171, and Abalone are used for experiments. These data sets are in the UCI Machine Learning Library Data Set. They cover the characteristics of various data sets such as high-dimensional and low-dimensional, multiple samples, and a few samples. The information of the night real data sets is shown in Table 5.

The experimental results on the real data set are shown in Tables 6 and 7. The folded data represent the best result, followed by the italic. Table 6 shows the ACC comparison of different algorithms under each data set. Table 7 shows the NMI comparison of different algorithms under each data set. Experimental results on real data sets show that for different real data sets, the clustering algorithm SNCM is superior to other clustering algorithms in most cases. Therefore, this also confirms the effectiveness of the clustering algorithm SNCM.

Furthermore, taking the average ACC value of SNCM, the average classification performance of the algorithm is 61.38%, which is higher than INCM (54.02%), FCM (53.30%), FC-PFS (53.23%), K-means (59.81%), Ncut (50.72%)), and Rcut (41.39%). The specific situation is shown in Figure 3.

For the parameters, in Figure 4, different exponents are given to verify the algorithm, and the average clustering accuracy of the proposed algorithm under different exponents is listed in the chart. We find that the clustering quality of SNCM is relatively stable. As the index increases, the accuracy of the SNCM algorithm also tends to increase. Therefore, the parameter value in the experimental part is 0.9 to improve the clustering accuracy of the SNCM algorithm.

Finally, we test the convergence of SNCM on the data sets. The results are shown in Figure 5. It can be seen that SNCM algorithm can absolutely converge with few interaction steps.

The SNCM algorithm improves the generalization ability of the algorithm by introducing regularization terms, so that the membership matrix has sparseness, and the calculation of membership considers the degree of sparseness k. Compared with the comparison algorithm, in most cases, the result of this algorithm is better than that of the comparison algorithm. The experiment of the algorithm on multiple data sets can also illustrate this point, and the parameter k has great influence on results.

5. Conclusion

In this paper, we have proposed a novel method, called neutrosophic clustering algorithm based on sparse regular term constraint. Different from the previous neutrosophic clustering algorithm, the algorithm proposed in this paper can handle the case of ambiguity m = 1, not limited to the condition of m > 1. Furthermore, the regular term is introduced to make the algorithm sparse, thereby reducing the computational complexity of the algorithm. Moreover, we propose a method to simplify the process of determining regularization parameters and improve the clustering effect. In addition, a large number of experiments show that the clustering results of the proposed algorithm on artificial data sets and real data sets are mostly better than other clustering algorithms. However, the parameter k in the algorithm has a greater impact on the clustering effect. So, we will focus on this in the future.

Data Availability

The data in this article come from the data set in the UCI Machine Learning Library and are available in the official database.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (61976130), Shaanxi Provincial Key Research and Development Program (2018KW-021), Shaanxi Provincial Natural Science Foundation of China (2020JQ-923), and Shaanxi’s Scientific and Technological Commission (Project no. 2019KRM072).