As Large Language Models (LLMs) are increasingly integrated into academic peer review, their vulnerability to adversarial prompts -- adversarial instructions embedded in submissions to manipulate outcomes -- emerges as a critical threat to scholarly integrity. To counter this, we propose a novel adversarial framework where a Generator model, trained to create sophisticated attack prompts, is jointly optimized with a Defender model tasked with their detection. This system is trained using a loss function inspired by Information Retrieval Generative Adversarial Networks, which fosters a dynamic co-evolution between the two models, forcing the Defender to develop robust capabilities against continuously improving attack strategies. The resulting framework demonstrates significantly enhanced resilience to novel and evolving threats compared to static defenses, thereby establishing a critical foundation for securing the integrity of peer review.
翻译:随着大语言模型日益融入学术同行评审,其对对抗性提示(嵌入于投稿中旨在操纵评审结果的对抗性指令)的脆弱性已成为威胁学术诚信的关键问题。为此,我们提出一种新型对抗框架:生成器模型经训练可构建复杂攻击提示,并与负责检测攻击的防御器模型联合优化。该系统采用受信息检索生成对抗网络启发的损失函数进行训练,促使两个模型动态协同进化,迫使防御器针对持续改进的攻击策略发展鲁棒防御能力。相较于静态防御机制,该框架对新型及演化中的威胁展现出显著增强的抵御能力,从而为保障同行评审的完整性奠定关键基础。