Selective Inference for K-means Clustering
Yiqun T. Chen, Daniela M. Witten; 24(152):1−41, 2023.
Abstract
We examine the issue of testing for a difference in means between clusters of observations identified through k-means clustering. Traditional hypothesis tests in this scenario result in an inflated Type I error rate. Recently, Gao et al. (2022) addressed a similar problem within the context of hierarchical clustering. However, their solution is specifically tailored to hierarchical clustering and cannot be applied to k-means clustering. In this paper, we propose a p-value that conditions on all intermediate clustering assignments in the k-means algorithm. We demonstrate that this p-value effectively controls the selective Type I error for testing the difference in means between a pair of clusters obtained via k-means clustering in finite samples. Furthermore, it can be efficiently computed. We apply our proposed approach to hand-written digits data and single-cell RNA-sequencing data.
[abs]