As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial hidden prompts, i.e., adversarial instructions embedded in submissions to manipulate outcomes, poses a critical threat to scholarly integrity. We propose SafeReview, a co-evolutionary adversarial training framework for defending LLM-based peer review systems against such attacks. SafeReview jointly trains a Generator model to create sophisticated attack prompts and a Defender model to preserve review integrity under adversarial manipulation. The Generator is optimized to produce increasingly effective prompt injections, while the Defender is strengthened through preference-based training to maintain consistent reviews between clean and attacked submissions. Experimental results show that SafeReview improves robustness against adaptive prompt injection attacks, better preserves paper ranking under attack, and generalizes across attacker architectures compared with static defenses. These results demonstrate the potential of co-evolutionary training as a foundation for securing LLM-assisted peer review.
翻译:随着大语言模型日益融入学术同行评审,其对对抗性隐藏提示——即嵌入稿件中以操纵评审结果的对抗性指令——的脆弱性,对学术诚信构成了严峻威胁。我们提出SafeReview,一种协同进化对抗训练框架,用于防御基于大语言模型的同行评审系统免受此类攻击。该框架联合训练生成器模型以创建复杂攻击提示,以及防御器模型以在对抗操纵下保持评审完整性。生成器通过优化持续产生更有效的提示注入,防御器则通过基于偏好的训练得到强化,使得对干净稿件与受攻击稿件的评审保持一致。实验结果表明,相较于静态防御方法,SafeReview能更有效地抵御自适应提示注入攻击,更好地保护受攻击时的论文排名,并且对攻击者架构具有泛化能力。这些结果证明了协同进化训练作为保障大语言模型辅助同行评审安全基础的潜力。