Overcoming the Curse of Dimensionality in Bayesian Model-Based Clustering

Noirrit Kiran Chandra, Antonio Canale, David B. Dunson; 24(144):1−42, 2023.

Abstract

When clustering high-dimensional data, Bayesian mixture models are commonly used to provide uncertainty quantification. However, as the dimension of the observations increases, the posterior inference often tends to favor either too many or too few clusters. In this article, we investigate this phenomenon by studying the random partition posterior in a non-standard setting with a fixed sample size and increasing data dimensionality. We establish conditions under which the finite sample posterior tends to assign every observation to a different cluster or all observations to the same cluster as the dimension grows. Notably, these conditions do not rely on the choice of clustering prior, as long as all possible partitions of observations into clusters have positive prior probabilities, and they hold regardless of the true data-generating model. To address this issue, we propose a class of latent mixtures for Bayesian clustering (Lamb), which operates on a set of low-dimensional latent variables that induce a partition on the observed data. The model facilitates scalable posterior inference and mitigates the challenges posed by high-dimensionality under mild assumptions. Through simulation studies and an application to inferring cell types based on scRNAseq, we demonstrate the good performance of the proposed approach.

[abs]

[pdf][bib]