Next Article in Journal
Likelihood Ratio Test and the Evidential Approach for 2 × 2 Tables
Previous Article in Journal
Novel Entropy for Enhanced Thermal Imaging and Uncertainty Quantification
Previous Article in Special Issue
Prediction Consistency Regularization for Learning with Noise Labels Based on Contrastive Clustering
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fast Fusion Clustering via Double Random Projection

1
School of Statistics and Mathematics, Shandong University of Finance and Economics, Jinan 250014, China
2
School of Science, Guangxi University of Science and Technology, Liuzhou 545006, China
3
Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
4
Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, AB T6G 2G1, Canada
5
Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan 250100, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2024, 26(5), 376; https://doi.org/10.3390/e26050376
Submission received: 12 March 2024 / Revised: 25 April 2024 / Accepted: 25 April 2024 / Published: 28 April 2024

Abstract

:
In unsupervised learning, clustering is a common starting point for data processing. The convex or concave fusion clustering method is a novel approach that is more stable and accurate than traditional methods such as k-means and hierarchical clustering. However, the optimization algorithm used with this method can be slowed down significantly by the complexity of the fusion penalty, which increases the computational burden. This paper introduces a random projection ADMM algorithm based on the Bernoulli distribution and develops a double random projection ADMM method for high-dimensional fusion clustering. These new approaches significantly outperform the classical ADMM algorithm due to their ability to significantly increase computational speed by reducing complexity and improving clustering accuracy by using multiple random projections under a new evaluation criterion. We also demonstrate the convergence of our new algorithm and test its performance on both simulated and real data examples.

1. Introduction

Clustering is a pivotal technique in unsupervised learning, applied extensively across various scientific and technological fields that handle large datasets. Clustering also plays a crucial role in data labelling, which sets the stage for the application of artificial intelligence and machine learning models [1,2] on the organized data to perform predictive analytics and classification tasks. Traditional clustering algorithms like k-means, Gaussian mixture models, and hierarchical clustering often face stability challenges due to their concave optimization formulations, which can lead to variability in results due to factors such as initial conditions or data outliers [3,4,5]. Recent advancements in convex or concave fusion methods have shown promise in enhancing stability, achieving more consistent global or local optimality and reliable estimation of cluster centers and counts through sparse-inducing penalties on pairwise centers [6,7,8,9]. For clustering high-dimensional data, the data can be mapped into a high-dimensional feature space (kernel space) for processing [10], or clustering can be achieved by optimizing a smooth and continuous objective function that is based on robust statistics [11]. This paper introduces a comprehensive empirical validation of these methods across simulation studies and real data analysis, detailing their improved stability over traditional methods and the practical implications of these advancements.
In fusion clustering, p-dimensional observations X i , i = 1 , , n are each parameterized by their own centroid μ i . These centroids are estimated under the assumption that all observations can be grouped into K clusters G 1 , , G K , such that for i G k , μ i = ρ k , where ρ k represents the cluster center for observations in cluster G k . Fusion clustering aims to concurrently estimate the cluster centroids ρ k and the partitions G k by minimizing the following objectives
1 2 i = 1 n X i μ i 2 + i < j p λ ( μ i μ j τ ) .
The penalty function p λ ( · τ ) is used to control the complexity of the model, and it is determined by the tuning parameter λ . The form of the norm used is represented by · τ . This penalty function is typically used in fusion clustering to encourage sparsity in the estimated cluster centroids.
The penalty function p λ ( · τ ) controls the complexity of the model and is determined by the tuning parameter λ . The norm used is · τ . The penalty function is typically used in fusion clustering to promote sparsity in cluster centroids.
Convex fusion clustering methods have been widely studied due to their computational simplicity and ability to find global optima. These methods often employ  1 , 2 , or penalties as the penalty function p λ ( · τ ) [12,13,14,15,16,17]. However, convex fusion can lead to biased estimates of the individual centroids, resulting in solutions with a large number of dense clusters [18,19]. To address this issue, researchers have proposed using concave fusion clustering methods, such as those using minimax concave penalties (MCPs) [20], truncated Lasso penalties (TLPs) [8], and arbitrary concave penalties.
While robust, convex and concave fusion clustering methods are computationally demanding with a O ( n 2 p ) complexity, which can limit their practicality in scenarios involving large sample sizes n and high-dimensional datasets p. This article proposes a strategy for overcoming this limitation using random projection techniques [21,22,23,24]. The approach involves the construction of a random diagonal matrix whose diagonal elements are sourced from a binary distribution. This matrix is then projected onto the pairwise component of the fusion method. By doing so, the number of pairwise differences between individual centroids, μ i μ j , is substantially reduced. This reduction not only decreases the computational load but also maintains the integrity of the clustering process, enhancing the algorithm’s scalability without excessively increasing the operational overhead. We provide empirical evidence demonstrating that this method significantly reduces the computational time while preserving the clustering quality, as shown in our simulation section.
In unsupervised learning, rapid clustering processes are crucial for handling large datasets efficiently. Our study introduces a novel approach to fusion clustering to enhance computational speed without compromising accuracy. Our contributions are summarized as follows: (1) We propose using random projection techniques to simplify the fusion aspect of clustering, effectively diminishing the pairwise centroids discrepancies and significantly boosting computational efficiency by minimizing the fusion step’s complexity. (2) We have developed a novel double recursive random projection ADMM method designed for efficient high-dimensional fusion clustering, improving the accuracy of clustering.
In the remainder of this paper, the proposed new ADMM algorithm will be described in Section 2. This section will also include an analysis of the computational complexity and convergence of the algorithm. It will also include a strategy for improving cluster accuracy. The finite-sample properties of the proposed new ADMM algorithm will be evaluated through simulation studies in Section 3, and the method will be demonstrated using a real data example in Section 4. Concluding remarks will be presented in Section 5, and technical proofs will be provided in the Appendix A and Appendix B.

2. Methodology

To improve convex or concave fusion clustering efficiency, we propose an extension of the classical ADMM algorithm based on a random projection called RP-ADMM. A random projection can significantly reduce the time and computational resources needed to analyze high-dimensional data, making it suitable for large datasets and real-time processing. In this section, we will discuss the RP-ADMM algorithm’s computational complexity and convergence.

2.1. Random Projection Based ADMM

Previous ADMM algorithms for convex or concave fusion clustering [6,8] have suffered from a high computational burden due to the need to consider all n ( n 1 ) / 2 pairwise differences between individual centroids. This is represented by the fusion matrix E = { ( e i e j ) , i < j } n ( n 1 ) 2 × n T , where e i is the ith unit vector with a 1 in the ith position and 0s elsewhere, and e i e j can be interpreted as the difference between the ith and jth individual centroids. The computational complexity of this approach is O ( n 2 ) , which becomes infeasible for large sample sizes n.
  • Bernoulli distribution-based random projections ADMM  
It is worth noting that pairwise differences between individual centroids can be deduced from other differences. For example, if we know that μ 1 μ 2 = 0 and μ 2 μ 3 = 0 , we can conclude that μ 1 μ 3 = 0 . This means that it may be unnecessary to consider the row e 1 e 3 in E . To reduce the computational burden of convex or concave fusion clustering, we propose a random projection approach. This only considers a small subset of the n ( n 1 ) / 2 pairwise differences between individual centroids. This is achieved by generating indicators π i j from a Bernoulli distribution with probability α . We then form a random matrix Π , which is a diagonal matrix with diagonal elements ( π 12 , , π 1 n , π 23 , , π 2 n , , π ( n 1 ) n ) T . If π i j = 1 , the difference between μ i and μ j is taken into account; if π i j = 0 , it is not considered. The probability α controls the size of the subset of pairwise differences considered. The matrix Π E can be seen as a projection of E onto a sparse matrix. This is with about n ( n 1 ) ( 1 α ) / 2 rows being zero vectors and about n ( n 1 ) α / 2 ones being nonzero vectors. This projection is based on a Bernoulli distribution. Finally, we form a new fusion matrix Ω by deleting the rows of zero vectors in Π E . The new fusion matrix is given by Ω = ( Ω 1 , , Ω κ ) T , where Ω j , j = 1 , , κ , denotes jth row vector of Ω .
 We just consider τ = 2 in (1) for simplicity and propose a random projection-based fusion criterion by
p ( μ ; λ ) = 1 2 i = 1 n X i μ i 2 + i < j π i j p λ ( μ i μ j ) ,
where μ = ( μ 1 , , μ n ) n × p T . Furthermore, the objective function in (2) is equivalent to
˜ p ( μ , ϕ ; λ ) = 1 2 X μ F 2 + j = 1 κ p λ ( | | ϕ j | | ) , subject to Ω μ ϕ = 0 ,
where X = ( X 1 , , X n ) T , ϕ = ( ϕ 1 , , ϕ κ ) κ × p T . Under the constraints in (3), the augmented Lagrangian Q ( μ , ϕ , η ; λ ) has the form
˜ p ( μ , ϕ ; λ ) + j = 1 κ η j T ( μ T Ω j ϕ j ) + φ 2 Ω μ ϕ F 2 ,
where the dual variables η = ( η 1 , , η κ ) κ × p T are Lagrange multipliers, and φ is a tuning parameter. Under the iterative value μ ( m ) and η ( m ) at the mth step, we conduct the Bernoulli distribution-based random projection ADMM (RP-ADMM) iterative algorithm and compute the estimates of ( ϕ , η , μ ) as follows:
ϕ ( m + 1 ) = arg min ϕ L ( ϕ , μ ( m ) , η ( m ) ; λ ) ,
η ( m + 1 ) = η ( m ) + φ ( Ω μ ( m ) ϕ ( m + 1 ) ) ,
μ ( m + 1 ) = arg min μ Q ( μ , ϕ ( m + 1 ) , η ( m + 1 ) ; λ ) ,
where L ( ϕ , μ ( m ) , η ( m ) ; λ ) equals
φ 2 Ω μ ( m ) ϕ + φ 1 η ( m ) F 2 + j = 1 κ p λ ( ϕ j ) ,
and Q ( μ , ϕ ( m + 1 ) , η ( m + 1 ) ; λ ) equals
˜ p ( μ , ϕ ( m + 1 ) ) + φ 2 Ω μ ϕ ( m + 1 ) F 2 + j = 1 κ η j T ( m + 1 ) ( μ T Ω j ϕ j ( m + 1 ) ) .
Ma and Huang (2017) [18] have argued that under (8), the element ϕ j ( m + 1 ) of ϕ ( m + 1 ) is the minimizer of φ 2 | | ζ j ( m ) ϕ j | | 2 + p λ ( | | ϕ j | | ) , where ζ j ( m ) = Ω j T μ ( m ) + φ 1 η j ( m ) . For different thresholding operator p λ ( · ) , the estimate ϕ j ( m + 1 ) has different results. Such as,
  • For the Lasso penalty [25],
    ϕ j ( m + 1 ) = S ( ζ j ( m ) , λ / φ ) ; S ( w , t ) = ( 1 t / | | w | | ) w , if t / | | w | | < 1 ; 0 , otherwise .
  • For SCAD penalty [26] with a > 1 / φ + 1 ,
    ϕ j ( m + 1 ) = S ( ζ j ( m ) , λ / φ ) , if | | ζ j ( m ) | | λ + λ / φ ; ζ j ( m ) , if | | ζ j ( m ) | | > a λ ; S ( ζ j ( m ) , a λ / ( ( a 1 ) φ ) ) 1 1 / ( ( a 1 ) φ ) , otherwise .
  • For the MCP [27] with a > 1 / φ ,
    ϕ j ( m + 1 ) = S ( ζ j ( m ) , λ / φ ) 1 1 / ( a φ ) , if | | ζ j ( m ) | | a λ ; ζ j ( m ) , otherwise .
  • For the TLP [8] with a > 1 ,
    ϕ j ( m + 1 ) = S ( ζ j ( m ) , λ / φ ) , if | | ζ j ( m ) | | a λ ; ζ j ( m ) , otherwise .
Through some algebra, the problem of (9) is equivalent to the minimization of the function h ( μ , ϕ ( m + 1 ) , η ( m + 1 ) ) , which has the from
1 2 X μ F 2 + φ 2 | | Ω μ ϕ ( m + 1 ) + φ 1 η ( m + 1 ) | | F 2 .
Under the given value of ϕ ( m + 1 ) , η ( m + 1 ) , the updated μ ( m + 1 ) are
μ ( m + 1 ) = ( φ Ω T Ω + I n ) 1 X + φ Ω T ( ϕ ( m + 1 ) φ 1 η ( m + 1 ) )
where I n is n × n identity matrix. μ ( m + 1 ) and ϕ ( m + 1 ) are updated according to the random projection ADMM iterative algorithm (5)–(7) until the input of some convergence criteria, such as both dual and primal residuals being close to zero [28] in our practice. The convergence time of ADMM is highly related to the penalty parameter φ . A poor selection of φ can result in a slow convergence for the ADMM algorithm [29] and thus RP-ADMM. In this paper, we fix φ = 1 throughout for simplicity.
To facilitate the updates of ( ϕ ( m + 1 ) , η ( m + 1 ) , μ ( m + 1 ) ) at the ( m + 1 ) th step in (5) to (7) of the RP-ADMM iterative algorithm, we need to specify a proper initial value (warm start). Here, we set η ( 0 ) = 0 , ϕ ( 0 ) = Ω μ ( 0 ) and obtain the initial estimators μ ( 0 ) = ( λ Ω T Ω + I n ) 1 X as the minimizer of a ridge fusion criterion
1 2 X μ F 2 + λ 2 Ω μ 2 .
We summarize the above analysis in Algorithm 1.
Algorithm 1 RP-ADMM for fusion clustering
Input: data X 1 , , X n ; Initialize μ ( 0 ) , η ( 0 ) ; tuning parameter, λ
Output: an estimate of μ
for m = 0 , 1 , 2 , do
  compute ϕ ( m + 1 ) using (5)
  compute η ( m + 1 ) using (6)
  compute μ ( m + 1 ) using (7)
  if convergence criterion is met, then
   Stop and denote the last iteration by μ ^ ( λ ) ,
  else
    m = m + 1
  end if
  end for
Practically, we would not want to conduct the RP-ADMM updates comprehensively until convergence to save computing time in the first iterations. Another trick is to adopt the initial values of subsequent convex relaxations as optimal values from the previous relaxed convex problem, which significantly reduces the number of RP-ADMM iterations.

2.2. Selection of Optimal Tuning Parameter

For a given λ , the converging value μ ^ ( λ ) of the above RP-ADMM procedure is defined as
μ ^ ( λ ) = argmin μ p ( μ ; λ ) ,
where p ( μ ; λ ) is defined in (2) and the optimal value of λ can be selected via a properly constructed data-driven criterion. In particular, we partition the support of λ into a grid of λ min = λ 0 < λ 1 < < λ J = λ max , and for each λ j , we compute a solution path of μ ^ ( λ j ) and obtain K ^ ( λ j ) distinct cluster centroids { ρ ^ 1 ( λ j ) , , ρ ^ K ^ ( λ j ) ( λ j ) } , The optimal λ ^ is selected by minimizing a data-driven BIC, i.e., λ ^ = argmin λ j ; j = 1 , , J BIC ( λ j ) , where
BIC ( λ ) = log 1 n p X μ ^ ( λ ) F 2 + ( log ( n p ) + 2 log ( p ) ) K ^ ( λ ) / n .
Subsequently, we obtain the estimator μ ^ = μ ^ ( λ ^ ) , and the individuals can be separated into K ^ = K ^ ( λ ^ ) clusters accordingly, i.e., G ^ k = { i : μ ^ i = ρ ^ k , i = 1 , n } , k = 1 , , K ^ .
Other methods for tuning parameters in clustering, such as generalized degrees of freedom with generalized cross-validation [8] and stability-based cross validation [25,30] can provide good results but may require extensive computation or the specification of a hyperparameter perturbation size [8]. In contrast, the proposed BIC is easy to compute and performs well in estimating cluster centroids and the true number of clusters (K). Figure 1 shows the change in BIC values against log ( λ ) and the cluster number of the simulation. Across all cases with different values of n and p, we observe that BIC( λ ) decreases as the value of log ( λ ) increases. With recovering the true cluster number K = 3 , BIC( λ ^ ) reaches a minimum at the optimal λ ^ . Moreover, when log ( λ ) keeps increasing, the cluster centroids are continuously integrated, and BIC( λ ) is enlarged. However, further research is needed to fully prove the consistency of the BIC in combination with the objective function (2).

2.3. Recursive RP-ADMM and Cluster Matrix

In the above cluster analysis, the effect of randomness on the clustering results was not considered. However, empirical analysis has shown that the impact of this randomness on the estimated cluster centers and numbers is minimal (i.e., ρ ^ k ’s and K ^ ’s). However, the impact on the final partitioning results (i.e., which observations are grouped into a single cluster) can be significant. In response to this, we propose the Recursive RP-ADMM (RRP-ADMM) procedure, which performs multiple RP-ADMM cluster analyses by generating M random matrices (i.e., Ω m ’s, m = 1 , , M ) and repeatedly conducting the analysis.
Once the multiple RP-ADMM cluster analyses have been completed, we must summarize the results. We define a n × n symmetric cluster matrix C where C i j = 1 denotes that the ith and jth observations belong to the same cluster; otherwise, C i j = 0 . Another n × n symmetric matrix D ^ is introduced, with element D ^ i j representing the relative frequency of the ith and jth observations belonging to the same cluster over the M independent RP-ADMM clustering procedures. The decision of whether the ith and jth observations should be grouped into a single cluster or not can then be treated as a classification problem, with the two possible class labels being 1 (belong to the same cluster) or 0 (do not belong to the same cluster). We can use an indicator function to transform the relative frequency into class labels and generate an estimator for the cluster matrix C ^ , i.e.,
C ^ = { C ^ i j : C ^ i j = 1 ( D ^ i j 0.5 ) } ,
where 1 ( · ) denotes the indicator function. We summarize the above procedure in Algorithm 2. This transformation can be understood as a voting-based aggregation strategy, similar to the one proposed by [31], which aims to reduce misclassification errors and improve the accuracy of the clustering. To evaluate the accuracy of the clustering results, we define a new measure called the similarity index (SI) between two data clusterings:
SI = 1 n 2 n C ^ C 1 = 1 n 2 n i = 1 n j = 1 n | C ^ i j C i j | .
Like the Rand Index (RI) measure [32], the newly introduced evaluation criterion can be seen as a measure of the percentage of correct decisions made by some algorithm. The SI values also range from 0 to 1, with lower values indicating better algorithm performance.
Algorithm 2 RRP-ADMM for fusion clustering
Input: data X 1 , , X n ; M; Initialize μ ( 0 ) , η ( 0 ) ; tuning parameter, λ
Output: an estimate of μ
for m = 0 , 1 , , M do
  compute μ ^ ( m ) using RP-ADMM
end for
while 1 i n do
  compute D ^ i j and C ^ i j from (13)
end while
The classical convex or concave fusion clustering procedure in (1) requires O ( n 2 p ) operations and O ( n 2 p + n p ) of storage for a single round of ADMM updates with primal and dual residual calculations, because all pairs of centroids are shrunk together in this method.
The RP-ADMM algorithm significantly improves computational efficiency compared to classical ADMM algorithm. It requires only O ( κ p + n p ) of storage, compared to O ( n 2 p + n p ) for the classical ADMM algorithm, because the variables η and ϕ have only κ columns rather than n ( n 1 ) / 2 . Additionally, the RP-ADMM algorithm requires only O ( κ p ) operations for its most computationally demanding step, in comparison to O ( n 2 p ) for the classical ADMM algorithm. The RP-ADMM algorithm also requires O ( κ n ) operations to conduct Cholesky factorization in every iteration, in comparison to O ( n 3 ) for the classical ADMM algorithm. This efficient Cholesky factorization is computed only once and reused across repeated RP-ADMM updates.
At the end of this subsection, we will demonstrate the convergence of the RP-ADMM algorithm by showing that the sequence generated by the algorithm contains a subsequence that converges to a stationary point.
Lemma 1.
Let { μ ( m ) , ϕ ( m ) , η ( m ) } k = 1 be the sequence generated by Algorithm 1, then for some constant c > 0 ,
Q ( μ ( m + 1 ) , ϕ ( m + 1 ) , η ( m + 1 ) ) Q ( μ ( m ) , ϕ ( m ) , η ( m ) ) c 2 μ ( m + 1 ) μ ( m ) 2 + ψ η ( m + 1 ) η ( m ) 2
In order to prove that the sequence { μ ( m ) , ϕ ( m ) , η ( m ) } k = 1 is convergent, we need to assume that ϕ ( m ) is bounded and ψ η ( m + 1 ) η ( m ) 0 which are often observed in numerical tests.
Theorem 1.
If { ϕ ( m ) } k = 1 are bounded and ψ 2 ν ( m + 1 ) ν ( m ) F + ψ 1 η ( m + 1 ) η ( m ) F 0 , then { μ ( m ) , ϕ ( m ) , η ( m ) } k = 1 is bounded. Moreover, there exist a subsequence { μ ( k j ) , ϕ ( k j ) , η ( k j ) } k j = 1 , such that
lim k j ( μ ( k j + 1 ) μ ( k j ) + ϕ ( k j + 1 ) ϕ ( k j ) + η ( k j + 1 ) η ( k j ) ) = 0 ,
and thus, { μ ( m ) , ϕ ( m ) , η ( m ) } k = 1 has a subsequence which converges to the stationary point.

3. Simulation

In this part of the study, simulation experiments were conducted to compare the performance of the extended and classical ADMM clustering algorithms in terms of computational time and clustering accuracy, using the evaluation criterion in (14). The Lasso-based fusion method often leads to the formation of dense clusters with a minor penalty for small differences in ϕ j , which can result in the formation of many spurious clusters with very small differences among them [6]. In contrast, the concave penalty method tends to produce a clear cluster structure and a well-defined number of clusters [8]. Therefore, in this study, we focus on the MCP-based fusion method [27] which compares the conventional ADMM’s clustering performance and the proposed new ADMM algorithm.

3.1. Low-Dimensional Setting

In this part, we evaluated the clustering performance of the classical ADMM, RP-ADMM, and RRP-ADMM algorithms on low-dimensional synthetic data generated from three overlapping convex clusters with the same spherical shape in some number of dimensions p and sample size n. The synthetic data were generated from three populations P k = N ( ρ k , Σ ) , k = 1 , , K with K = 3 , ρ 1 = 3 p , ρ 2 = 0 p , ρ 3 = 3 p and Σ = ( σ k j ) p × p with σ j j = 1 and σ k j = 0 . 1 | k j | for k j . This setting was chosen deliberately to allow overlap in the sample sets generated from clusters proximal to each other, thereby increasing the complexity of the clustering task. As illustrated in Figure 2c, the clustering performance using a single random projection (RP-ADMM) was suboptimal, indicating challenges with cluster separability under this setup. Conversely, Figure 2b demonstrates that recursive random projection (RRP-ADMM) significantly improved clustering results. The recursive times for the RP-ADMM and RRP-ADMM algorithms were set to M = 10 .
To evaluate the accuracy of the RP-ADMM, relax-and-split approach [33] (RS-ADMM) and RRP-ADMM algorithms in recovering the true cluster matrix, we generated a random sample of n = 60 observations with 1–20 drawn from P 1 , 21–40 drawn from P 2 , and 41–60 drawn from P 3 , and set the number of dimensions to p = 5 . The probability α of generating a 1 in the random matrix was set to α = c log ( n ) n , where c controls the probability size. The level plots in Figure 2 use colour to visualize the values of 1’s and 0’s in the cluster matrix. The results show that both RP-ADMM and RRP-ADMM can accurately recover the true cluster matrix, with RRP-ADMM showing more accurate gradation than the true cluster matrix. Single random projection (RP-ADMM) can cause high variance in clustering outcomes due to the randomness of the sampling process. To mitigate this issue, we have adopted the voting-based pooling technique [31], which reduces variance by averaging results from recursive random projection (RRP-ADMM).
To further evaluate the performance of the algorithms, we calculated the values of the index SI defined in (14) after 100 replicates under different c choices. We depicted the results as boxplots in Figure 3. These results show that RRP-ADMM consistently improves clustering accuracy compared to RP-ADMM, as evidenced by the smaller median and standard error of SI values.
Next, we will compare the performance of classical ADMM and RRP-ADMM in terms of computation time per iteration and the SI after 100 trials. The sample size is varied with n = 60 , 150 , 240 , 360 points and α = 4 log ( n ) n , while p = 2 is kept constant. In this study, we have limited the number of points to 360, as the classical ADMM algorithm requires a significant amount of computation time for a single realization with more points. We will also compare the performance of the Similarity Index (SI) and Rand Index (RI) in evaluating the clustering results. Therefore, we should calculate the partitioning structure of all points based on the estimated cluster matrix graph. This process involves first identifying the point a 1 with the most neighbors and aggregating the connected points with point a 1 as cluster 1, then finding the second point a 2 with the most edges to form cluster 2, and repeating this process until there are no more points remaining.
Table 1 shows the mean values of the SI, RI, and the consumed time in seconds for different sample sizes under different methods after 100 replicates. Based on the data in Table 1, we can observe the following: (i) The proposed RRP-ADMM significantly reduces the time required for convex or concave fusion clustering, especially when the sample size increases. (ii) RRP-ADMM produces smaller SI and larger RI values, possibly due to the voting-based pooling technique improving cluster accuracy. (iii) As the sample size increases, the SI and RI values decrease. The boxplots in Figure 4 and Figure 5 demonstrate the superiority of the RRP-ADMM algorithm over the classical ADMM algorithm in terms of both the SI values and the square root of run time, as seen in the results obtained from 100 replicates with four different sample sizes. These results further reinforce our belief in the effectiveness of the RRP-ADMM algorithm.

3.2. High-Dimensional Setting

In this part, we investigate using the double random projection-based alternating direction method of multiplier (DRP-ADMM and DRRP-ADMM) algorithms for clustering high-dimensional data sets. We employ a recursive Gaussian distribution-based random projection strategy in the first step to mitigate the impact of randomness on cluster results. Since the classical ADMM algorithm is computationally intensive in high-dimensional settings, we focus on evaluating the performance of the DRP-ADMM and DRRP-ADMM algorithms with recursive times M = 9 , using three Gaussian random projections in the outer layer and three binary random projections in the inner layer. The simulated data sets consist of two overlapping convex clusters with the same spherical shape. They are generated using a population P k = N ( ρ k , Σ ) , k = 1 , 2 with ρ 1 = 1 p , ρ 2 = 1 p . Furthermore, Σ = ( σ k j ) p × p with σ j j = 1 and σ k j = 0 . 1 | k j | for k j . We consider four high-dimensional cases with p = 1000 , 2000 , 3000 , 5000 and a fixed sample size of n = 100 .
We evaluate the accuracy of the DRP-ADMM and DRRP-ADMM algorithms in recovering the true cluster matrix. To do this, we first generate a Gaussian random matrix R with dimensions p × q in the first projection. The elements of R correspond to N ( 0 , 1 / q ) . We set q = κ ε 2 / 2 ε 3 / 3 log ( n ) with ε = 1 and κ = 5 6 . See [21,23] for the number of projections. In the second step, we generate a diagonal binary random matrix with probability α = 4 log ( n ) n of equaling one. Then, we calculate the values of the SI index defined in Equation (14) and plot the results as boxplots in Figure 6 after 100 replicates for different values of p. The results show that the DRRP-ADMM algorithm consistently outperforms the DRP-ADMM algorithm regarding the median and standard error of the SI values for all values of p, indicating that the DRRP-ADMM algorithm improves clustering accuracy.

4. Real Data Analysis

In this study, we use the DrivFace dataset to demonstrate the effectiveness of our proposed clustering procedure. The DrivFace database consists of n = 606 images of 640,480 pixels each, captured from four drivers (two women and two men) over different days and containing p = 17 facial features such as glasses and beards. Each driver’s images containing similar facial features can be grouped into one cluster, resulting in a total of K = 4 clusters as shown in Figure 7a. Firstly, we know the true labels of the dataset; that is, there are four clusters, and we also know which observations belong to the common cluster. Secondly, because the similarity among observations in the pictures is very high across different clusters, it is challenging to separate them. Therefore, we can use this dataset to evaluate our proposed clustering method.
Due to the large sample size of the DrivFace dataset, we do not use the classical ADMM algorithm, which would require 606 × ( 606 1 ) × 17 / 2 operations in a single ADMM iteration. Instead, we first scale the samples by each feature and apply the RP-ADMM procedure to estimate individual centers using a grid of λ values. We plot the f u s i o n g r a m s of four selected variables in Figure 8, and the scrutiny of Figure 8a implies that some outlying points (influential points) cause the clusters to be dense. We then remove these 55 points and plot a new f u s i o n g r a m in Figure 8b. The optimal λ value, as determined by the developed BIC criterion in Equation (12), is 1.38 , indicating that the estimated number of clusters is four, the same as the number of drivers. We apply the proposed RRP-ADMM algorithm with a Bernoulli-distribution-based random projection procedure to further improve the cluster accuracy using α = 10 log ( n ) / n and a recursive number M = 20 . Using the estimated optimal tuning parameter of 1.38 , we obtain the estimated cluster matrix in Figure 7b, which closely resembles the true cluster matrix in Figure 7a. The calculated similarity index (SI) value is 0.098 . Moreover, the value of Adjusted Rand Index (ARI) is 0.672.

5. Conclusions

We propose using the recursive random projection-based ADMM (RRP-ADMM) method to improve the speed and accuracy of convex and nonconvex fusion clustering. In simulations and real data examples, the RRP-ADMM method demonstrates superior performance in fast calculation and accurate clustering results. The RRP-ADMM algorithm is scalable and can be applied to deal with heterogeneous issues in any setting that involves fusion techniques.
However, some challenges still need to be addressed in this field. One challenge is efficiently transforming the cluster matrix graph into the target partitioning structure and determining the optimal number of clusters. Another challenge is using prior information about which points are more likely to be integrated into a single cluster to reduce the number of pairwise comparisons. Additionally, a further study is needed to determine the theoretical probability of achieving a probability of one in binary random projection. Another future research direction involves performing clustering simultaneously with feature selection, using techniques such as incorporating feature weights [34] or introducing sparsity [14].

Author Contributions

Conceptualization, Y.Z.; Methodology, L.K.; Software, H.W.; Formal analysis, J.Y.; Writing—original draft, X.Y.; Writing—review & editing, B.J.; Supervision, N.L. All authors have read and agreed to the published version of the manuscript.

Funding

Xiaodong Yan was supported by National Key R&D Program of China (No. 2023YFA1008701), the National Natural Science Foundation of China (No. 12371292), the National Statistical Science Research Project (No. 2022LY080) and Jinan Science and Technology Bureau (No. 2021GXRC056). Na Li was supported by grants from the National Natural Science Foundation of China (No. 12171279), and the China Academy of Engineering Science and Technology Development Strategy Shandong Research Institute Consulting Research Project (No. 202302SDZD04). Hongni Wang was supported by the State Scholarship Fund from China Scholarship Council (No. 202208370132). Bei Jiang and Linglong Kong were partially supported by grants from the Canada CIFAR AI Chairs program, the Alberta Machine Intelligence Institute (AMII), and Natural Sciences and Engineering Council of Canada (NSERC), and Linglong Kong was also partially supported by grants from the Canada Research Chair program from NSERC.

Data Availability Statement

The DrivFace dataset is publicly available at UC Irvine Machine Learning Repository https://archive.ics.uci.edu/dataset/378/drivface, accessed on 11 March 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proof of Lemma 1

By the objection function,
Q ( μ ( m + 1 ) , ϕ ( m + 1 ) , η ( m + 1 ) ) Q ( μ ( m + 1 ) , ϕ ( m + 1 ) , η ( m ) ) = ψ η ( m + 1 ) η ( m ) 2
and
Q ( μ ( m + 1 ) , ϕ ( m + 1 ) , η ( m ) ) Q ( μ ( m + 1 ) , ϕ ( m ) , η ( m ) ) 0
Moreover, μ Q ( μ , ϕ ( m ) , η ( m ) ) is strongly convex, as the Hessian matrix ( ψ Ω T Ω + I n p ) is positive definite, and there exists a constant c > 0 such that the following inequality holds:
Q ( μ ( m + 1 ) , ϕ ( m ) , η ( m ) ) Q ( μ ( m ) , ϕ ( m ) , η ( m ) ) c 2 μ ( m + 1 ) μ ( m ) 2
Summing (A1)–(A3), we have the result of the above Lemma. In order to prove that the sequence { μ ( m ) , ϕ ( m ) , η ( m ) } k = 1 is convergent, we need to assume that ϕ ( m ) is bounded and ψ η ( m + 1 ) η ( m ) 0 , which are often observed in numerical tests.

Appendix B. Proof of Theorem 1

Since { ϕ ( m ) } k = 1 are bounded, μ ( m ) is also bounded. So Q ( μ ( m ) , ϕ ( m ) , η ( m ) ) and { μ ( m ) , ϕ ( m ) , η ( m ) } k = 1 are bounded. For convenience, we note
L ( m ) : = Q ( μ ( m ) , ϕ ( m ) , η ( m ) ) , y ( m ) : = c 2 μ ( m + 1 ) μ ( m ) 2 , z ( m ) : = η ( m + 1 ) η ( m ) 2 .
Since L ( m ) is bounded, there exist a subsequence { L ( k j ) } , such that
lim k j L ( k j ) = lim inf k L ( m )
By Lemma 1 and lim k z ( m ) 0 , we have
lim inf k j y ( k j ) lim inf k j ( L ( k j ) L ( k j + 1 ) + z ( k j ) ) = lim inf k L ( m ) lim inf k j L ( k j + 1 ) 0 .
As y ( k j ) 0 , lim inf k j y ( k j ) = 0 , which means
lim inf k j μ ( k j + 1 ) μ ( k j ) = 0 ,
together with η ( m + 1 ) η ( m ) 0 , we have
lim inf k j ϕ ( k j + 1 ) ϕ ( k j ) = 0 .
The sequence { μ ( m ) , ϕ ( m ) , η ( m ) } k = 1 have a subsequence { μ ( k j ) , ϕ ( k j ) , η ( k j ) } k j = 1 which converges to a point { μ * , ϕ * , η * } . Then, we have
Ω j μ * ϕ j * = 0 , 1 j κ .
Moreover, the procedure to solve the objective function satisfies the following optimality system:
μ ( m + 1 ) X + ψ Ω T ( Ω μ ( m + 1 ) ϕ ( m ) + η ( m ) ψ ) = 0 , 0 ψ ( Ω j μ ( m + 1 ) ϕ j ( m + 1 ) + η j ( m + 1 ) ψ ) + p λ ( ϕ j ) ϕ j | ϕ j = ϕ j ( m + 1 ) .
So,
μ * X Ω T η * = 0 , 0 η j * + P λ ( ϕ j ) ϕ j | ϕ j = ϕ j * .
Therefore, { μ * , ϕ * , η * } is a KKT point of objective function. We complete the proof.

References

  1. Haq, M.A. CDLSTM: A novel model for climate change forecasting. Comput. Mater. Contin. 2022, 71, 2. [Google Scholar] [CrossRef]
  2. Haq, M.A. SMOTEDNN: A novel model for air pollution forecasting and AQI classification. Comput. Mater. Contin. 2022, 71, 1. [Google Scholar] [CrossRef]
  3. Van Der Kloot, W.A.; Spaans, A.M.J.; Heiser, W.J. Instability of hierarchical cluster analysis due to input order of the data: The PermuCLUSTER solution. Psychol. Methods 2005, 10, 468. [Google Scholar] [CrossRef] [PubMed]
  4. Xu, R.; Wunsch, D. Survey of clustering algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef] [PubMed]
  5. Yang, X.; Yan, X.; Huang, J. High-dimensional integrative analysis with homogeneity and sparsity recovery. J. Multivar. Anal. 2019, 174, 104529. [Google Scholar] [CrossRef]
  6. Chi, E.C.; Lange, K. Splitting methods for convex clustering. J. Comput. Graph. Stat. 2015, 24, 994–1013. [Google Scholar] [CrossRef] [PubMed]
  7. Lindsten, F.; Ohlsson, H.; Ljung, L. Clustering using sum-of-norms regularization: With application to particle filter output computation. In Proceedings of the 2011 IEEE Statistical Signal Processing Workshop (SSP), Nice, France, 28–30 June 2011; pp. 201–204. [Google Scholar] [CrossRef]
  8. Pan, W.; Shen, X.; Liu, B. Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty. J. Mach. Learn. Res. 2013, 14, 1865. [Google Scholar]
  9. Yang, X.; Yan, X. Mechanism and a new algorithm for nonconvex clustering. J. Stat. Comput. Sim. 2020, 90, 719–746. [Google Scholar] [CrossRef]
  10. Paul, D.; Chakraborty, S.; Das, S.; Xu, J. Implicit annealing in kernel spaces: A strongly consistent clustering approach. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5862–5871. [Google Scholar] [CrossRef]
  11. Shah, S.A.; Koltun, V. Robust continuous clustering. Proc. Natl. Acad. Sci. USA 2017, 114, 9814–9819. [Google Scholar] [CrossRef]
  12. Hocking, T.D.; Joulin, A.; Bach, F.; Vert, J.P. Clusterpath an algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning, Washington, DC, USA, 28 June–2 July 2011; p. 1. [Google Scholar]
  13. Radchenko, P.; Mukherjee, G. Convex clustering via l1 fusion penalization. J. R. Stat. Soc. B. 2017, 79, 1527–1546. [Google Scholar] [CrossRef]
  14. Wang, B.; Zhang, Y.; Sun, W.W.; Fang, Y. Sparse convex clustering. J. Comput. Graph. Stat. 2018, 27, 393–403. [Google Scholar] [CrossRef]
  15. Yan, X.; Yin, G.; Zhao, X. Subgroup analysis in censored linear regression. Stat. Sinica 2021, 31, 1027–1054. [Google Scholar] [CrossRef]
  16. Yan, X.; Wang, H.; Zhou, Y.; Yan, J.; Wang, Y.; Wang, W.; Xie, J.; Yang, S.; Zeng, Z.; Chen, X. Heterogeneous logistic regression for estimation of subgroup effects on hypertension. J. Biopharm. Stat. 2022, 32, 969–985. [Google Scholar] [CrossRef] [PubMed]
  17. Zhu, C.; Xu, H.; Leng, C.; Yan, S. Convex optimization procedure for clustering: Theoretical revisit. Adv. Neural Inf. Process. Syst. 2014, 1619–1627. [Google Scholar]
  18. Ma, S.; Huang, J. A concave pairwise fusion approach to subgroup analysis. J. Am. Stat. Assoc. 2017, 112, 410–423. [Google Scholar] [CrossRef]
  19. Ma, S.; Huang, J. Estimating subgroup-specific treatment effects via concave fusion. arXiv 2016, arXiv:1607.03717. [Google Scholar]
  20. Marchetti, Y.; Zhou, Q. Iterative subsampling in solution path clustering of noisy big data. arXiv 2014, arXiv:1412.1559. [Google Scholar] [CrossRef]
  21. Achlioptas, D. Database-Friendly Random Projections. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Santa Barbara, CA, USA, 21–23 May 2001; Association for Computing Machinery: New York, NY, USA, 2001; pp. 274–281. [Google Scholar] [CrossRef]
  22. Ailon, N.; Chazelle, B. The Fast Johnson–Lindenstrauss Transform and Approximate Nearest Neighbors. SIAM J. Comput. 2009, 39, 302–322. [Google Scholar] [CrossRef]
  23. Bingham, E.; Mannila, H. Random Projection in Dimensionality Reduction: Applications to Image and Text Data. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001; Association for Computing Machinery: New York, NY, USA, 2001; pp. 245–250. [Google Scholar] [CrossRef]
  24. Kane, D.M.; Nelson, J. Sparser johnson-lindenstrauss transforms. J. ACM 2014, 61, 1–23. [Google Scholar] [CrossRef]
  25. Tibshirani, R.; Walther, G. Cluster validation by prediction strength. J. Comput. Graph. Stat. 2005, 14, 511–528. [Google Scholar] [CrossRef]
  26. Fan, J.; Lv, J. Nonconcave penalized likelihood with NP-dimensionality. IEEE T. Inform. Theory 2011, 57, 5467–5484. [Google Scholar] [CrossRef]
  27. Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [PubMed]
  28. Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Found. Trends Mach. Learn. 2011, 3, 1–122. [Google Scholar] [CrossRef]
  29. Ghadimi, E.; Teixeira, A.; Shames, I.; Johansson, M. Optimal parameter selection for the alternating direction method of multipliers (ADMM): Quadratic problems. IEEE Trans. Autom. Control 2014, 60, 644–658. [Google Scholar] [CrossRef]
  30. Liu, B.; Shen, X.; Pan, W. Integrative and regularized principal component analysis of multiple sources of data. Stat. Med. 2016, 35, 2235–2250. [Google Scholar] [CrossRef]
  31. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
  32. Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
  33. Zheng, P.; Aravkin, A. Relax-and-split method for nonconvex inverse problems. Inverse Probl. 2020, 36, 095013. [Google Scholar] [CrossRef]
  34. Chakraborty, S.; Xu, J. Biconvex clustering. J. Comput. Graph. Stat. 2023, 32, 1524–1536. [Google Scholar] [CrossRef]
Figure 1. Plots of BIC values against log ( λ ) and the estimated cluster number of simulation with different n, p and true cluster number K = 3 .
Figure 1. Plots of BIC values against log ( λ ) and the estimated cluster number of simulation with different n, p and true cluster number K = 3 .
Entropy 26 00376 g001
Figure 2. The level plots of cluster matrix including the true one in the left panel, estimators calculated from RRP-ADMM and RP-ADMM in the middle and right panels, respectively.
Figure 2. The level plots of cluster matrix including the true one in the left panel, estimators calculated from RRP-ADMM and RP-ADMM in the middle and right panels, respectively.
Entropy 26 00376 g002
Figure 3. Boxplots of SI values through RP-ADMM and RRP-ADMM algorithms, respectively, under four choices of c after 100 replicates.
Figure 3. Boxplots of SI values through RP-ADMM and RRP-ADMM algorithms, respectively, under four choices of c after 100 replicates.
Entropy 26 00376 g003
Figure 4. Boxplots of SI values through classical ADMM and RRP-ADMM algorithms, respectively, under four choices of sample sizes n after 100 replicates.
Figure 4. Boxplots of SI values through classical ADMM and RRP-ADMM algorithms, respectively, under four choices of sample sizes n after 100 replicates.
Entropy 26 00376 g004
Figure 5. Boxplots of the square root of the run time through classical ADMM and RRP-ADMM algorithms, respectively, under four choices of sample sizes n after 100 replicates.
Figure 5. Boxplots of the square root of the run time through classical ADMM and RRP-ADMM algorithms, respectively, under four choices of sample sizes n after 100 replicates.
Entropy 26 00376 g005
Figure 6. Boxplots of SI values through DRP-ADMM and DRRP-ADMM algorithms, respectively, under four choices of dimensions p after 100 replicates.
Figure 6. Boxplots of SI values through DRP-ADMM and DRRP-ADMM algorithms, respectively, under four choices of dimensions p after 100 replicates.
Entropy 26 00376 g006
Figure 7. True (a) and estimated (b) cluster matrix in DrivFace data.
Figure 7. True (a) and estimated (b) cluster matrix in DrivFace data.
Entropy 26 00376 g007
Figure 8. The above f u s i o n g r a m s are plotted from 4 selected variables in DrivFace data before (left panel) and after (right panel) deleting the influence points, respectively.
Figure 8. The above f u s i o n g r a m s are plotted from 4 selected variables in DrivFace data before (left panel) and after (right panel) deleting the influence points, respectively.
Entropy 26 00376 g008
Table 1. The mean values of Similarity index (SI), Rand Index (RI) and run time in seconds against different sample sizes and different methods after 100 replicates.
Table 1. The mean values of Similarity index (SI), Rand Index (RI) and run time in seconds against different sample sizes and different methods after 100 replicates.
ADMM RRP-ADMM RS-ADMM
Sample SizeSIRITime SIRITime SIRITime
n = 60 0.0810.9217 0.0590.9332 0.0800.92510
n = 150 0.0580.94588 0.0460.9577 0.0560.947121
n = 240 0.0490.962352 0.0450.97417 0.0470.966551
n = 360 0.0420.9731582 0.0400.98641 0.0420.9781864
Note: ‘SI’ represents the similarity index defined in (14), ‘RI’ denotes Rand Index [32]. TIME is the required time in seconds in a single round of ADMM.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, H.; Li, N.; Zhou, Y.; Yan, J.; Jiang, B.; Kong, L.; Yan, X. Fast Fusion Clustering via Double Random Projection. Entropy 2024, 26, 376. https://doi.org/10.3390/e26050376

AMA Style

Wang H, Li N, Zhou Y, Yan J, Jiang B, Kong L, Yan X. Fast Fusion Clustering via Double Random Projection. Entropy. 2024; 26(5):376. https://doi.org/10.3390/e26050376

Chicago/Turabian Style

Wang, Hongni, Na Li, Yanqiu Zhou, Jingxin Yan, Bei Jiang, Linglong Kong, and Xiaodong Yan. 2024. "Fast Fusion Clustering via Double Random Projection" Entropy 26, no. 5: 376. https://doi.org/10.3390/e26050376

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop