Recent developments in adversarial attacks on deep learning leave many mission-critical natural language processing (NLP) systems at risk of exploitation. To address the lack of computationally efficient adversarial defense methods, this paper reports a novel, universal technique that drastically improves the robustness of Bidirectional Encoder Representations from Transformers (BERT) by combining the unitary weights with the multi-margin loss. We discover that the marriage of these two simple ideas amplifies the protection against malicious interference. Our model, the unitary multi-margin BERT (UniBERT), boosts post-attack classification accuracies significantly by 5.3% to 73.8% while maintaining competitive pre-attack accuracies. Furthermore, the pre-attack and post-attack accuracy tradeoff can be adjusted via a single scalar parameter to best fit the design requirements for the target applications.
翻译:深度学习模型面临的对抗攻击的最新进展使许多关键任务自然语言处理系统面临被恶意利用的风险。为解决当前缺乏计算高效的对抗防御方法的问题,本文提出了一种新颖的通用技术,通过将酉权重与多间隔损失函数相结合,显著提升了基于Transformer的双向编码器表示模型的鲁棒性。我们发现这两种简单思想的结合能有效增强对恶意干扰的防御能力。我们提出的模型——酉多间隔BERT,在保持具有竞争力的原始攻击前准确率的同时,将攻击后的分类准确率显著提升了5.3%至73.8%。此外,攻击前与攻击后准确率之间的权衡可通过单一标量参数进行调节,从而最佳地适应目标应用的设计需求。