Adam's Implicit Bias: Unveiling the Truth (arXiv:2309.00079v1 [cs.LG])

Previous research utilized backward error analysis to identify ordinary differential equations (ODEs) that approximate the gradient descent trajectory. These studies revealed that finite step sizes implicitly regulate solutions by penalizing the two-norm of the loss gradients in the ODEs. In this study, we demonstrate that the existence of similar implicit regularization in RMSProp and Adam depends on their hyperparameters and the training stage. However, a different “norm” is involved: the ODE terms either penalize the (perturbed) one-norm of the loss gradients or hinder its decrease (typically). To support our findings, we conduct numerical experiments and discuss the potential impact of these proven facts on generalization.

Adam’s Implicit Bias: Unveiling the Truth (arXiv:2309.00079v1 [cs.LG])