AI-generated content (AIGC) detectors are increasingly deployed in high-stakes settings such as academic integrity screening, yet their reliability rests on a fundamental paradox: as language models are trained on human-written corpora, the statistical boundary between AI and human writing will inevitably dissolve as models improve. Commercial incentives have further distorted this landscape -- detection services and "de-AIification" tools often operate within the same supply chain, replacing evaluation of content quality with judgment of content origin. We present StyleShield, the first flow matching framework for conditional text style transfer, operating directly in continuous token embedding space via a DiT backbone with zero-initialized cross-attention adapters conditioned on frozen Qwen-7B representations. At inference, we adapt the SDEdit paradigm from image synthesis to text embeddings, with a single parameter gamma providing smooth continuous control over the evasion-preservation trade-off. On a multi-domain Chinese benchmark, StyleShield achieves 94.6% evasion against the training detector and >=99% against three unseen detectors, maintaining 0.928 semantic similarity. We further introduce RateAudit, a document-level scheduling algorithm that demonstrates detection-rate verdicts can be set to arbitrary values, directly questioning the reliability of score-based evaluation.
翻译:人工智能生成内容(AIGC)检测器正越来越多地被部署于学术诚信筛查等高风险场景,但其可靠性建立在一个根本性悖论之上:随着语言模型在人类撰写语料上进行训练,AI与人类写作之间的统计边界必将随着模型性能提升而不可避免地被消解。商业激励进一步扭曲了这一格局——检测服务与“去AI化”工具常处于同一供应链体系内,将内容质量评估替换为内容来源判定。我们提出StyleShield——首个面向条件文本风格迁移的流匹配框架,通过基于零初始化交叉注意力适配器的DiT骨干网络,直接在连续词元嵌入空间中运行,并以冻结的Qwen-7B表征为条件。在推理阶段,我们将图像合成领域的SDEdit范式适配至文本嵌入,通过单一参数gamma对逃避-保留权衡进行平滑连续控制。在多领域中文基准测试中,StyleShield对训练集检测器实现94.6%的规避率,对三种未见检测器实现≥99%的规避率,同时保持0.928的语义相似度。我们进一步提出文件级调度算法RateAudit,证明检测率判定可被设置为任意数值,直接质疑了基于评分的评估体系的可信度。