As machine learning (ML) systems are being increasingly employed in the real world to handle sensitive tasks and make decisions in various fields, the security and privacy of those models have also become increasingly critical. In particular, Deep Neural Networks (DNN) have been shown to be vulnerable to backdoor attacks whereby adversaries have access to the training data and the opportunity to manipulate such data by inserting carefully developed samples into the training dataset. Although the NLP community has produced several studies on generating backdoor attacks proving the vulnerable state of language modes, to the best of our knowledge, there does not exist any work to combat such attacks. To bridge this gap, we present RobustEncoder: a novel clustering-based technique for detecting and removing backdoor attacks in the text domain. Extensive empirical results demonstrate the effectiveness of our technique in detecting and removing backdoor triggers. Our code is available at https://github.com/marwanomar1/Backdoor-Learning-for-NLP
翻译:随着机器学习系统越来越多地被应用于现实世界中处理敏感任务并在各领域做出决策,这些模型的安全性和隐私性也变得日益重要。特别地,深度神经网络已被证明易受到后门攻击,即攻击者能够访问训练数据并有机会通过向训练数据集中插入精心设计的样本来操纵这些数据。尽管自然语言处理社区已有若干研究生成后门攻击以证明语言模型的脆弱性,但据我们所知,目前尚无任何工作来对抗此类攻击。为填补这一空白,我们提出了RobustEncoder:一种新颖的基于聚类的技术,用于检测和移除文本领域中的后门攻击。广泛的经验性结果证明了我们的技术在检测和移除后门触发器方面的有效性。我们的代码可在https://github.com/marwanomar1/Backdoor-Learning-for-NLP获取。