Statistical Robustness of Empirical Risks in Machine Learning
Shaoyan Guo, Huifu Xu, Liwei Zhang; 24(125):1−38, 2023.
Abstract
This paper investigates the convergence of empirical risks in reproducing kernel Hilbert spaces (RKHS). Previous research has assumed that empirical training data are generated by the unknown true probability distribution, but this assumption may not hold in practical situations. As a result, the existing convergence results may not guarantee the reliability of empirical risks when the data are potentially corrupted (generated from a distribution perturbed from the true distribution). In this paper, we address this gap from a robust statistics perspective (Krätschmer, Schied and Zähle (2012); Krätschmer, Schied and Zähle (2014); Guo and Xu (2020). First, we establish moderate sufficient conditions under which the expected risk remains stable (continuous) against small perturbations of the probability distributions of the underlying random variables, and we analyze how the cost function and kernel impact this stability. Second, we compare the statistical estimators of the expected optimal loss based on pure data and contaminated data using the Prokhorov metric and Kantorovich metric, and derive asymptotic qualitative and non-asymptotic quantitative statistical robustness results. Third, we identify appropriate metrics that ensure the statistical estimators are uniformly asymptotically consistent. These results provide a theoretical foundation for analyzing asymptotic convergence and examining the reliability of the statistical estimators in various regression models.
[abs]