Benign overfitting in ridge regression

Alexander Tsigler, Peter L. Bartlett; 24(123):1−76, 2023.

Abstract

Many modern applications of deep learning involve neural networks with a large number of parameters compared to the amount of training data. This has led to a significant amount of theoretical research on overparameterized models. One important phenomenon in this regime is the ability of the model to interpolate noisy data while still achieving a lower test error than the amount of noise in the data. In a previous work (arXiv:1906.11300), the authors characterized the conditions under which this phenomenon can occur in linear regression, considering the interpolating solution with minimum ℓ2-norm and independent components in the data. They provided a sharp bound on the variance term and showed that it can be small if and only if the data covariance has high effective rank in a subspace of small co-dimension. In this paper, we strengthen and expand on their results by removing the independence assumption and providing sharp bounds for the bias term. Our results apply in a more general setting, including kernel regression, and not only explain how the noise is damped but also what part of the true signal is learned. Furthermore, we extend our analysis to ridge regression, where we present general sufficient conditions under which the optimal regularization is negative.

[abs]


[pdf]
[bib]