As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.
翻译:随着人工智能生成的评审从实验工具转向同行评审基础设施,大多数鲁棒性担忧集中于显式攻击,例如隐藏指令和提示注入。我们研究了一种更困难且更具政策相关性的失效模式:没有隐藏文本,没有提示注入,且不修改方法、实验、图表、方程、证明或数值结果。攻击者仅修改呈现层内容,如摘要、贡献框架、相关工作、讨论和叙事结构。我们引入了对抗性重构:一种闭环攻击方法,利用AI评审者的反馈来搜索呈现层修订,同时保持科学证据不变。在三个主流AI评审系统中,对抗性重构实现了75.1%的攻击成功率和平均得分提升+1.21/10。这种效果无法用普通散文润色来解释。我们还发现,改变评审者对论文解读方式的策略(如相关工作重新定位和分析性讨论扩展)显著优于表面编辑(如局部润色、表格格式化和算法框图)。我们的分析揭示了两种更深层的结构性失效模式。第一,AI评审者更容易被打动而非说服:突出优势能可靠地提升感知价值,而尝试消除弱点往往适得其反。第二,AI评审者可能混淆“看似解决了局限”与“实际解决了局限”之间的区别,使得未改变的证据被重新解读为更强的科学贡献。这些结果表明,部署风险不仅来自恶意的隐藏指令,更在于论文呈现本身已成为一个可优化的表面。我们发布了一个无污染滚动基准测试和攻击框架,用于检验AI评审者是否能在仅修改呈现层的情况下保持对科学内容的锚定。