Consistent Model-based Clustering using the Quasi-Bernoulli Stick-breaking Process
Cheng Zeng, Jeffrey W Miller, Leo L Duan; 24(153):1−32, 2023.
Abstract
In applications of mixture modeling and clustering, it is often unknown how many components and clusters exist. One approach is to use a stick-breaking mixture model, such as the Dirichlet process mixture model, which assumes an infinite number of components and shrinks the weights of unused components to near zero. However, this shrinkage is insufficient and can lead to inconsistent estimates of the number of clusters, even when the component distribution is correctly specified. In this article, we propose a solution by introducing a quasi-Bernoulli random variable that multiplies the length of the second piece when breaking the mixture weight stick into two. This creates a soft truncation and further reduces the weights of unused components. We show that as long as the small constant diminishes to zero at a rate faster than $o(1/n^2)$, where $n$ is the sample size and the data is from a finite mixture model, the posterior distribution will converge to the true number of clusters. In comparison, we examine Dirichlet process mixture models with a concentration parameter that is either constant or rapidly diminishes to zero, both of which lead to inconsistency in estimating the number of clusters. Our proposed model is easy to implement, only requiring a small modification of a standard Gibbs sampler for mixture models. In simulations and an application to clustering brain networks, our method accurately recovers the true number of clusters and leads to a small number of clusters.
[abs]
[code]