Editors of academic journals and program chairs of conferences require peer reviewers to write their own reviews. However, there is growing concern about the rise of lazy reviewing practices, where reviewers use large language models (LLMs) to generate reviews instead of writing them independently. Existing tools for detecting LLM-generated content are not designed to differentiate between fully LLM-generated reviews and those merely polished by an LLM. In this work, we employ a straightforward approach to identify LLM-generated reviews - doing an indirect prompt injection via the paper PDF to ask the LLM to embed a watermark. Our focus is on presenting watermarking schemes and statistical tests that maintain a bounded family-wise error rate, when a venue evaluates multiple reviews, with a higher power as compared to standard methods like Bonferroni correction. These guarantees hold without relying on any assumptions about human-written reviews. We also consider various methods for prompt injection including font embedding and jailbreaking. We evaluate the effectiveness and various tradeoffs of these methods, including different reviewer defenses. We find a high success rate in the embedding of our watermarks in LLM-generated reviews across models. We also find that our approach is resilient to common reviewer defenses, and that the bounds on error rates in our statistical tests hold in practice while having the power to flag LLM-generated reviews, while Bonferroni correction is infeasible.
翻译:学术期刊编辑和会议程序委员会主席要求同行评审员独立撰写评审意见。然而,人们日益担忧"懒惰评审"现象的兴起,即评审员使用大型语言模型生成评审意见,而非独立撰写。现有的LLM生成内容检测工具并非设计用于区分完全由LLM生成的评审与仅经LLM润色的评审。在本研究中,我们采用一种直接的方法来识别LLM生成的评审——通过论文PDF文件进行间接提示注入,要求LLM嵌入水印。我们的重点是提出水印方案和统计检验方法,这些方法在学术机构评估多份评审时能维持有界的族错误率,并且相较于Bonferroni校正等标准方法具有更高的检验效能。这些保证的成立无需依赖任何关于人工撰写评审的假设。我们还考虑了多种提示注入方法,包括字体嵌入和越狱技术。我们评估了这些方法的有效性及各种权衡因素,包括不同的评审员防御策略。我们发现,在不同模型中,我们的水印在LLM生成评审中的嵌入成功率很高。同时,我们的方法对常见的评审员防御策略具有鲁棒性,并且我们的统计检验在实际应用中能维持误差率界限,同时具备标记LLM生成评审的能力,而Bonferroni校正在此场景下并不可行。