Existing works have shown that fine-tuned textual transformer models achieve state-of-the-art prediction performances but are also vulnerable to adversarial text perturbations. Traditional adversarial evaluation is often done \textit{only after} fine-tuning the models and ignoring the training data. In this paper, we want to prove that there is also a strong correlation between training data and model robustness. To this end, we extract 13 different features representing a wide range of input fine-tuning corpora properties and use them to predict the adversarial robustness of the fine-tuned models. Focusing mostly on encoder-only transformer models BERT and RoBERTa with additional results for BART, ELECTRA, and GPT2, we provide diverse evidence to support our argument. First, empirical analyses show that (a) extracted features can be used with a lightweight classifier such as Random Forest to predict the attack success rate effectively, and (b) features with the most influence on the model robustness have a clear correlation with the robustness. Second, our framework can be used as a fast and effective additional tool for robustness evaluation since it (a) saves 30x-193x runtime compared to the traditional technique, (b) is transferable across models, (c) can be used under adversarial training, and (d) robust to statistical randomness. Our code is publicly available at \url{https://github.com/CaptainCuong/RobustText_ACL2024}.
翻译:现有研究表明,经过微调的文本Transformer模型虽能实现最先进的预测性能,但也容易受到对抗性文本扰动的影响。传统的对抗性评估通常在模型微调完成后进行,且往往忽略训练数据的影响。本文旨在证明训练数据与模型鲁棒性之间同样存在强相关性。为此,我们提取了13种不同的特征,这些特征涵盖了输入微调语料库的广泛属性,并利用它们来预测微调后模型的对抗鲁棒性。我们主要聚焦于仅编码器型Transformer模型BERT和RoBERTa,同时补充了BART、ELECTRA和GPT2的实验结果,从多角度为我们的论点提供证据。首先,实证分析表明:(a) 提取的特征可与轻量级分类器(如随机森林)结合,有效预测攻击成功率;(b) 对模型鲁棒性影响最大的特征与鲁棒性存在明确相关性。其次,我们的框架可作为快速有效的鲁棒性评估辅助工具,因为其具备以下优势:(a) 相比传统技术节省30至193倍运行时间,(b) 具备跨模型可迁移性,(c) 适用于对抗训练场景,(d) 对统计随机性具有鲁棒性。我们的代码已公开于\url{https://github.com/CaptainCuong/RobustText_ACL2024}。