Investigating the Role of Pre-training in Lifelong Learning: An Empirical Study

Sanket Vaibhav Mehta, Darshan Patil, Sarath Chandar, Emma Strubell; 24(214):1−50, 2023.

Abstract

The lifelong learning paradigm in machine learning is an appealing alternative to isolated learning schemes, as it resembles biological learning and has the potential to reduce energy waste by avoiding excessive model re-training. However, a major challenge in this paradigm is the occurrence of catastrophic forgetting. With the growing popularity and success of pre-trained models, we aim to investigate the role of pre-training in lifelong learning, specifically in relation to addressing catastrophic forgetting. We evaluate existing methods using large, pre-trained models and measure their performance on various text and image classification tasks. Additionally, we conduct a large-scale study using a novel dataset comprising 15 diverse NLP tasks. Across all scenarios, we observe that generic pre-training implicitly mitigates the effects of catastrophic forgetting when learning multiple tasks sequentially, outperforming randomly initialized models. Furthermore, we delve into the reasons behind the effectiveness of pre-training in mitigating forgetting. Through an analysis of the loss landscape, we discover that pre-trained weights lead to wider minima, which helps alleviate forgetting. Based on this insight, we propose a joint optimization approach that optimizes for current task loss and loss basin sharpness to explicitly encourage wider basins during sequential fine-tuning. Our results demonstrate that this optimization approach surpasses several state-of-the-art task-sequential continual learning algorithms in multiple scenarios, sometimes even without the need for a memory that scales with the number of tasks.

[abs]

[pdf][bib]
      
[code]