Bayesian Data Selection
Eli N. Weinstein, Jeffrey W. Miller; 24(23):1−72, 2023.
Abstract
To gain insights into complex, high-dimensional data, it is important to identify features of the data that either match or do not match a given model of interest. In order to formalize this task, we introduce the concept of the “data selection” problem, which involves finding a lower-dimensional statistic, such as a subset of variables, that is well fit by a specific parametric model of interest. While a fully Bayesian approach to data selection would involve parametrically modeling the value of the statistic, nonparametrically modeling the remaining “background” components of the data, and performing Bayesian model selection for the choice of statistic, fitting a nonparametric model to high-dimensional data can be inefficient in terms of both statistics and computation. In this paper, we propose a novel score called the “Stein volume criterion (SVC)” for data selection, which eliminates the need for fitting a nonparametric model. The SVC is a generalized marginal likelihood that utilizes a kernelized Stein discrepancy instead of the Kullback-Leibler divergence. We prove that the SVC is consistent for data selection and establish the consistency and asymptotic normality of the corresponding generalized posterior on parameters. We apply the SVC to the analysis of single-cell RNA sequencing data sets using probabilistic principal components analysis and a spin glass model of gene regulation.
[abs]
[code]