This paper introduces a novel adversarial attack method targeting text classification models, termed the Modified Word Saliency-based Adversarial At-tack (MWSAA). The technique builds upon the concept of word saliency to strategically perturb input texts, aiming to mislead classification models while preserving semantic coherence. By refining the traditional adversarial attack approach, MWSAA significantly enhances its efficacy in evading detection by classification systems. The methodology involves first identifying salient words in the input text through a saliency estimation process, which prioritizes words most influential to the model's decision-making process. Subsequently, these salient words are subjected to carefully crafted modifications, guided by semantic similarity metrics to ensure that the altered text remains coherent and retains its original meaning. Empirical evaluations conducted on diverse text classification datasets demonstrate the effectiveness of the proposed method in generating adversarial examples capable of successfully deceiving state-of-the-art classification models. Comparative analyses with existing adversarial attack techniques further indicate the superiority of the proposed approach in terms of both attack success rate and preservation of text coherence.
翻译:本文提出了一种针对文本分类模型的新型对抗攻击方法,称为基于词显著性改进的对抗攻击(MWSAA)。该技术基于词显著性概念,通过策略性地扰动输入文本,在保持语义连贯性的同时误导分类模型。通过改进传统对抗攻击方法,MWSAA显著提升了规避分类系统检测的能力。该方法首先通过显著性估计过程识别输入文本中的显著词,优先选取对模型决策过程最具影响力的词语;随后,基于语义相似度度量对这些显著词进行精心设计的修改,确保修改后的文本保持连贯且保留原始含义。在多个文本分类数据集上的实证评估表明,该方法生成的对抗样本能够成功欺骗当前最先进的分类模型。与现有对抗攻击技术的对比分析进一步证明,本方法在攻击成功率和文本连贯性保持方面均具有优越性。