The use of large language models (LLMs) in peer review systems has attracted growing attention, making it essential to examine their potential vulnerabilities. Prior attacks rely on prompt injection, which alters manuscript content and conflates injection susceptibility with evaluation robustness. We propose the Paraphrasing Adversarial Attack (PAA), a black-box optimization method that searches for paraphrased sequences yielding higher review scores while preserving semantic equivalence and linguistic naturalness. PAA leverages in-context learning, using previous paraphrases and their scores to guide candidate generation. Experiments across five ML and NLP conferences with three LLM reviewers and five attacking models show that PAA consistently increases review scores without changing the paper's claims. Human evaluation confirms that generated paraphrases maintain meaning and naturalness. We also find that attacked papers exhibit increased perplexity in reviews, offering a potential detection signal, and that paraphrasing submissions can partially mitigate attacks.
翻译:大型语言模型(LLM)在同行评审系统中的使用日益受到关注,因此有必要审视其潜在脆弱性。现有攻击方法依赖于提示注入,这会改变稿件内容,并将注入易感性与评估鲁棒性混为一谈。我们提出释义对抗性攻击(PAA),这是一种黑盒优化方法,通过搜索能够获得更高评审分数的释义序列,同时保持语义等价性和语言自然度。PAA利用上下文学习能力,使用先前的释义及其得分来指导候选序列生成。在涵盖五个机器学习与自然语言处理会议、使用三种LLM审稿模型和五种攻击模型的实验中,PAA在不改变论文核心主张的情况下持续提升了评审分数。人工评估证实生成的释义保持了语义完整性和语言自然度。我们还发现,受攻击论文的评审文本表现出更高的困惑度,这为检测攻击提供了潜在信号;同时,对投稿内容进行释义处理能够部分缓解此类攻击。