The increasing misuse of AI-generated texts (AIGT) has motivated the rapid development of AIGT detection methods. However, the reliability of these detectors remains fragile against adversarial evasions. Existing attack strategies often rely on white-box assumptions or demand prohibitively high computational and interaction costs, rendering them ineffective under practical black-box scenarios. In this paper, we propose Multi-stage Alignment for Style Humanization (MASH), a novel framework that evades black-box detectors based on style transfer. MASH sequentially employs style-injection supervised fine-tuning, direct preference optimization, and inference-time refinement to shape the distributions of AI-generated texts to resemble those of human-written texts. Experiments across 6 datasets and 5 detectors demonstrate the superior performance of MASH over 11 baseline evaders. Specifically, MASH achieves an average Attack Success Rate (ASR) of 92%, surpassing the strongest baselines by an average of 24%, while maintaining superior linguistic quality.
翻译:随着AI生成文本(AIGT)的滥用日益增多,AIGT检测方法得以快速发展。然而,这些检测器在面对对抗性规避攻击时的可靠性仍然脆弱。现有攻击策略通常依赖于白盒假设,或需要极高的计算与交互成本,使其在实际黑盒场景中难以生效。本文提出多阶段对齐风格拟人化框架(MASH),这是一种基于风格迁移的新型规避黑盒检测器的框架。MASH依次采用风格注入监督微调、直接偏好优化和推理时优化,使AI生成文本的分布逼近人类书写文本的分布。在6个数据集和5种检测器上的实验表明,MASH在11种基线规避方法中表现最优。具体而言,MASH实现了平均92%的攻击成功率(ASR),较最强基线平均提升24%,同时保持了卓越的语言质量。