Pre-training is known to generate universal representations for downstream tasks in large-scale deep learning such as large language models. Existing literature, e.g., \cite{kim2020adversarial}, empirically observe that the downstream tasks can inherit the adversarial robustness of the pre-trained model. We provide theoretical justifications for this robustness inheritance phenomenon. Our theoretical results reveal that feature purification plays an important role in connecting the adversarial robustness of the pre-trained model and the downstream tasks in two-layer neural networks. Specifically, we show that (i) with adversarial training, each hidden node tends to pick only one (or a few) feature; (ii) without adversarial training, the hidden nodes can be vulnerable to attacks. This observation is valid for both supervised pre-training and contrastive learning. With purified nodes, it turns out that clean training is enough to achieve adversarial robustness in downstream tasks.
翻译:预训练已知能为大规模深度学习(如大语言模型)产生适用于下游任务的通用表征。现有文献(例如 \cite{kim2020adversarial})通过实验观察到,下游任务可以继承预训练模型的对抗鲁棒性。我们为这一鲁棒性继承现象提供了理论依据。理论结果揭示,特征纯化在连接两层神经网络中预训练模型与下游任务的对抗鲁棒性方面起着重要作用。具体而言,我们证明:(i) 在对抗训练下,每个隐藏节点倾向于仅选取一个(或少数几个)特征;(ii) 若无对抗训练,隐藏节点可能易受攻击。这一观察结果对监督预训练和对比学习均成立。利用纯化后的节点,仅需干净训练便足以使下游任务获得对抗鲁棒性。