We introduce a new defense mechanism to combat backdoor attacks on Deep Neural Networks (DNNs), where adversaries secretly implant malicious behaviors known as backdoors into the DNNs. Our defense approach belongs to the category of post-development defenses, which are independent of the model’s generation process. Our proposed defense strategy is based on a unique reverse engineering technique that can directly extract the backdoor functionality from a compromised model and transfer it to a specialized backdoor expert model.
The method is straightforward: by fine-tuning the compromised model using a small set of deliberately mislabeled clean samples, we induce the model to unlearn its normal functionality while still retaining the backdoor functionality. This results in the creation of a backdoor expert model that is capable of exclusively recognizing backdoor inputs. Leveraging the extracted backdoor expert model, we demonstrate the possibility of developing highly accurate backdoor input detectors that can filter out backdoor inputs during model inference. To further enhance the defense mechanism, we combine it with an ensemble strategy using a fine-tuned auxiliary model.
Our defense mechanism, known as BaDExpert (Backdoor Input Detection with Backdoor Expert), effectively mitigates 16 state-of-the-art backdoor attacks while causing minimal impact on the clean utility of the model. We have successfully verified the effectiveness of BaDExpert on multiple datasets, including CIFAR10, GTSRB, and ImageNet, utilizing various model architectures such as ResNet, VGG, MobileNetV2, and Vision Transformer.