Large Language Models (LLMs) are increasingly being integrated into the scientific peer-review process, raising new questions about their reliability and resilience to manipulation. In this work, we investigate the potential for hidden prompt injection attacks, where authors embed adversarial text within a paper's PDF to influence the LLM-generated review. We begin by formalising three distinct threat models that envision attackers with different motivations -- not all of which implying malicious intent. For each threat model, we design adversarial prompts that remain invisible to human readers yet can steer an LLM's output toward the author's desired outcome. Using a user study with domain scholars, we derive four representative reviewing prompts used to elicit peer reviews from LLMs. We then evaluate the robustness of our adversarial prompts across (i) different reviewing prompts, (ii) different commercial LLM-based systems, and (iii) different peer-reviewed papers. Our results show that adversarial prompts can reliably mislead the LLM, sometimes in ways that adversely affect a "honest-but-lazy" reviewer. Finally, we propose and empirically assess methods to reduce detectability of adversarial prompts under automated content checks.
翻译:大型语言模型(LLMs)正日益融入科学同行评审过程,这引发了关于其可靠性和抗操纵能力的新问题。在本研究中,我们探讨了隐藏提示注入攻击的可能性,即作者在论文PDF中嵌入对抗性文本以影响LLM生成的评审。我们首先正式建立了三种不同的威胁模型,设想了具有不同动机的攻击者——并非所有动机都隐含恶意意图。针对每种威胁模型,我们设计了对抗性提示,这些提示对人类读者不可见,却能引导LLM输出朝着作者期望的结果发展。通过一项涉及领域学者的用户研究,我们得出了四种用于从LLM获取同行评审的代表性评审提示。随后,我们评估了对抗性提示的鲁棒性,涵盖(i)不同评审提示,(ii)不同基于LLM的商业系统,以及(iii)不同同行评审论文。结果表明,对抗性提示能够可靠地误导LLM,有时会以对“诚实但懒惰”的评审人产生不利影响的方式实现。最后,我们提出并实证评估了在自动化内容检查下降低对抗性提示可检测性的方法。