Adversarial attacks are a type of attack on machine learning models where an attacker deliberately modifies the inputs to cause the model to make incorrect predictions. Adversarial attacks can have serious consequences, particularly in applications such as autonomous vehicles, medical diagnosis, and security systems. Work on the vulnerability of deep learning models to adversarial attacks has shown that it is very easy to make samples that make a model predict things that it doesn't want to. In this work, we analyze the impact of model interpretability due to adversarial attacks on text classification problems. We develop an ML-based classification model for text data. Then, we introduce the adversarial perturbations on the text data to understand the classification performance after the attack. Subsequently, we analyze and interpret the model's explainability before and after the attack
翻译:对抗攻击是一种针对机器学习模型的攻击方式,攻击者故意修改输入以导致模型做出错误预测。对抗攻击可能产生严重后果,尤其在自动驾驶、医疗诊断和安全系统等应用中尤为突出。关于深度学习模型对抗攻击脆弱性的研究表明,生成使模型做出非预期预测的样本非常容易。本文分析了对抗攻击对文本分类问题中模型可解释性的影响。我们开发了一个基于机器学习的文本数据分类模型,随后在文本数据中引入对抗扰动以观察攻击后的分类性能,进而分析并解读攻击前后模型的可解释性变化。