One-Shot Neural Architecture Search (NAS) algorithms often rely on training a super-network that is hardware agnostic for a specific task. These algorithms then extract optimal sub-networks from the trained super-network for different hardware platforms. However, training super-networks from scratch can be very time consuming and computationally intensive, especially for large models that require a two-stage training process of pre-training and fine-tuning. While state-of-the-art pre-trained models are available for a wide range of tasks, their large sizes limit their usability on various hardware platforms.
To address these issues, we propose a method called InstaTune. InstaTune leverages off-the-shelf pre-trained weights for large models and generates a super-network during the fine-tuning stage. This approach offers several advantages. Firstly, since the process occurs during fine-tuning, it reduces the overall time and compute resources required for NAS. Secondly, the sub-networks extracted by InstaTune are optimized specifically for the target task, unlike previous methods that optimize based on the pre-training objective. Lastly, InstaTune is easy to integrate into existing frameworks.
We employ multi-objective evolutionary search algorithms along with lightly trained predictors to discover Pareto-optimal sub-networks that outperform their respective baselines in terms of accuracy and MACs (multiply-accumulate operations). Our approach demonstrates good performance across both unimodal architectures such as ViT and BERT, as well as multi-modal architectures like BEiT-3, which are transformer-based.