Adversarial examples, which are inputs deliberately perturbed with imperceptible changes to induce model errors, have raised serious concerns for the reliability and security of deep neural networks (DNNs). While adversarial attacks have been extensively studied in continuous data domains such as images, the discrete nature of text presents unique challenges. In this paper, we propose Irony-based Adversarial Examples (IAE), a method that transforms straightforward sentences into ironic ones to create adversarial text. This approach exploits the rhetorical device of irony, where the intended meaning is opposite to the literal interpretation, requiring a deeper understanding of context to detect. The IAE method is particularly challenging due to the need to accurately locate evaluation words, substitute them with appropriate collocations, and expand the text with suitable ironic elements while maintaining semantic coherence. Our research makes the following key contributions: (1) We introduce IAE, a strategy for generating textual adversarial examples using irony. This method does not rely on pre-existing irony corpora, making it a versatile tool for creating adversarial text in various NLP tasks. (2) We demonstrate that the performance of several state-of-the-art deep learning models on sentiment analysis tasks significantly deteriorates when subjected to IAE attacks. This finding underscores the susceptibility of current NLP systems to adversarial manipulation through irony. (3) We compare the impact of IAE on human judgment versus NLP systems, revealing that humans are less susceptible to the effects of irony in text.
翻译:对抗性示例是指通过施加难以察觉的微小扰动来诱导模型产生错误判断的输入,这类示例已引发对深度神经网络可靠性与安全性的严重关切。尽管对抗性攻击在图像等连续数据领域已得到广泛研究,但文本的离散特性带来了独特挑战。本文提出基于反讽的对抗性示例方法,通过将直述句转化为反讽句来构建对抗性文本。该方法利用反讽这一修辞手段——其实际含义与字面解读相反,需要更深层的语境理解才能识别。IAE方法的挑战性主要体现在:需要准确定位评价性词汇、替换为恰当的搭配组合、在保持语义连贯性的同时扩展合适的反讽要素。本研究的主要贡献包括:(1)提出利用反讽生成文本对抗性示例的IAE策略。该方法不依赖现有反讽语料库,可灵活应用于多种自然语言处理任务的对抗文本生成。(2)实验表明,多个前沿深度学习模型在情感分析任务中遭受IAE攻击时性能显著下降,这揭示了当前自然语言处理系统对反讽式对抗操纵的脆弱性。(3)通过对比IAE对人类判断与自然语言处理系统的影响,发现人类对文本反讽效应的敏感度显著低于自然语言处理系统。