The increasing misuse of AI-generated texts (AIGT) has motivated the rapid development of AIGT detection methods. However, the reliability of these detectors remains fragile against adversarial evasions. Existing attack strategies often rely on white-box assumptions or demand prohibitively high computational and interaction costs, rendering them ineffective under practical black-box scenarios. In this paper, we propose Multi-stage Alignment for Style Humanization (MASH), a novel framework that evades black-box detectors based on style transfer. MASH sequentially employs style-injection supervised fine-tuning, direct preference optimization, and inference-time refinement to shape the distributions of AI-generated texts to resemble those of human-written texts. Experiments across 6 datasets and 5 detectors demonstrate the superior performance of MASH over 11 baseline evaders. Specifically, MASH achieves an average Attack Success Rate (ASR) of 92%, surpassing the strongest baselines by an average of 24%, while maintaining superior linguistic quality.
翻译:AI生成文本(AIGT)的滥用日益严重,这推动了AIGT检测方法的快速发展。然而,这些检测器在面对对抗性规避攻击时仍显脆弱。现有攻击策略通常依赖白盒假设,或要求高昂的计算与交互成本,导致其在实际黑盒场景中效果不佳。本文提出多阶段风格人性化对齐框架(MASH),这是一种基于风格迁移的新型黑盒检测器规避框架。MASH依次采用风格注入监督微调、直接偏好优化和推理时精炼,使AI生成文本的分布与人类书写文本相似。在6个数据集和5个检测器上的实验表明,MASH的性能显著优于11种基线规避方法。具体而言,MASH的平均攻击成功率(ASR)达到92%,比最强基线平均高出24%,同时保持了优越的语言质量。