No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions

As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.

翻译：随着人工智能生成的评审从实验工具转向同行评审基础设施，大多数鲁棒性担忧集中于显式攻击，例如隐藏指令和提示注入。我们研究了一种更困难且更具政策相关性的失效模式：没有隐藏文本，没有提示注入，且不修改方法、实验、图表、方程、证明或数值结果。攻击者仅修改呈现层内容，如摘要、贡献框架、相关工作、讨论和叙事结构。我们引入了对抗性重构：一种闭环攻击方法，利用AI评审者的反馈来搜索呈现层修订，同时保持科学证据不变。在三个主流AI评审系统中，对抗性重构实现了75.1%的攻击成功率和平均得分提升+1.21/10。这种效果无法用普通散文润色来解释。我们还发现，改变评审者对论文解读方式的策略（如相关工作重新定位和分析性讨论扩展）显著优于表面编辑（如局部润色、表格格式化和算法框图）。我们的分析揭示了两种更深层的结构性失效模式。第一，AI评审者更容易被打动而非说服：突出优势能可靠地提升感知价值，而尝试消除弱点往往适得其反。第二，AI评审者可能混淆“看似解决了局限”与“实际解决了局限”之间的区别，使得未改变的证据被重新解读为更强的科学贡献。这些结果表明，部署风险不仅来自恶意的隐藏指令，更在于论文呈现本身已成为一个可优化的表面。我们发布了一个无污染滚动基准测试和攻击框架，用于检验AI评审者是否能在仅修改呈现层的情况下保持对科学内容的锚定。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《美国防部对人工智能和 LLM 编写评估因素的信心与偏见》2024最新275页论文

专知会员服务

64+阅读 · 2024年3月4日

如何提示？浙大最新《大型语言模型提示框架》综述

专知会员服务

83+阅读 · 2023年11月23日

【ICCV2023】稳定且因果推断的自监督深度视觉表示判别方法

专知会员服务

30+阅读 · 2023年8月20日

视觉中怎么用提示？南洋理工CVPR2023《视觉提示》教程，附290页ppt

专知会员服务

82+阅读 · 2023年6月30日