Implicit Bias of Gradient Descent for Mean Squared Error Regression with Two-Layer Wide Neural Networks
Hui Jin, Guido Montufar; Volume 24, Issue 137, Pages 1-97, 2023.
Abstract
This study focuses on investigating the training of wide neural networks using gradient descent and the resulting implicit bias in function space. For univariate regression, the research demonstrates that training a shallow ReLU network with width $n$ leads to a solution that is within $n^{-1/2}$ of the function fitting the training data. The difference between this solution and the initial function has the smallest 2-norm of the second derivative, weighted by a curvature penalty that depends on the probability distribution used to initialize the network parameters. The study explicitly computes the curvature penalty function for various common initialization procedures. For example, asymmetric initialization with a uniform distribution results in a constant curvature penalty, which means that the solution function is the natural cubic spline interpolation of the training data. The same implicit bias result is obtained for stochastic gradient descent. The study also finds a similar result for different activation functions. For multivariate regression, the second derivative is replaced by the Radon transform of a fractional Laplacian in an analogous result. Initialization schemes that yield a constant penalty function result in polyharmonic splines as solutions. Furthermore, the study shows that the training trajectories can be captured by trajectories of smoothing splines with decreasing regularization strength.
[abs]
[code]