Driven by surging submission volumes, scientific peer review has catalyzed two parallel trends: individual over-reliance on LLMs and institutional AI-powered assessment systems. This study investigates the robustness of "LLM-as-a-Judge" systems to adversarial PDF manipulation via invisible text injections and layout aware encoding attacks. We specifically target the distinct incentive of flipping "Reject" decisions to "Accept," a vulnerability that fundamentally compromises scientific integrity. To measure this, we introduce the Weighted Adversarial Vulnerability Score (WAVS), a novel metric that quantifies susceptibility by weighting score inflation against the severity of decision shifts relative to ground truth. We adapt 15 domain-specific attack strategies, ranging from semantic persuasion to cognitive obfuscation, and evaluate them across 13 diverse language models (including GPT-5 and DeepSeek) using a curated dataset of 200 official and real-world accepted and rejected submissions (e.g., ICLR OpenReview). Our results demonstrate that obfuscation techniques like "Maximum Mark Magyk" and "Symbolic Masking & Context Redirection" successfully manipulate scores, achieving decision flip rates of up to 86.26% in open-source models, while exposing distinct "reasoning traps" in proprietary systems. We release our complete dataset and injection framework to facilitate further research on the topic (https://anonymous.4open.sciencer/llm-jailbreak-FC9E/).
翻译:在投稿量激增的驱动下,科学同行评审催生了两个并行趋势:个体对大型语言模型(LLM)的过度依赖和机构采用AI驱动的评估系统。本研究调查了“LLM即评审员”系统在面对通过隐形文本注入和布局感知编码攻击进行的对抗性PDF篡改时的鲁棒性。我们特别关注将“拒绝”决定翻转为“接受”这一独特动机,该漏洞从根本上损害了科学诚信。为量化此问题,我们引入了加权对抗脆弱性评分(WAVS),这是一种新颖的度量标准,通过根据决策偏移相对于真实情况的严重程度对分数膨胀进行加权,来量化系统的易受攻击性。我们调整了15种领域特定的攻击策略(从语义说服到认知混淆),并使用包含200份官方及真实世界已接收与拒稿提交(例如ICLR OpenReview)的精选数据集,在13种不同的语言模型(包括GPT-5和DeepSeek)上进行了评估。我们的结果表明,诸如“最大标记魔法”和“符号掩蔽与上下文重定向”等混淆技术能够成功操纵评分,在开源模型中实现了高达86.26%的决策翻转率,同时揭示了专有系统中独特的“推理陷阱”。我们发布了完整的数据集和注入框架,以促进该主题的进一步研究(https://anonymous.4open.sciencer/llm-jailbreak-FC9E/)。