Prefill Awareness in Large Language Models

Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. If AI models can recognize and act on the fact their prior assistant messages have been inserted or edited, the effectiveness and validity of these methods could be compromised. We investigate whether frontier language models can distinguish between tampered and untampered assistant-side context, a capability we call prefill awareness. To do so, we construct a binary preference benchmark across three prefill mechanisms, filtering for cases where models show consistent stances. We find that frontier models show substantial prefill awareness: Claude Opus 4.5 detects prefills opposing its preferences in 9-35% of cases with a 0% false positive rate when prompted; additionally, models often revert towards baseline behavior without explicitly reporting that the prefill was foreign. Controlled ablations later also show that detection and resistance rely on different cues, where stylistic mismatch mainly affects whether models flag a prefill as foreign, while preference mismatch mainly affects whether they revert toward their baseline answer. We also examine more realistic agentic settings such as misalignment-continuation evaluations and SWE-bench trajectories, where frontier models sometimes disavow prefilled assistant turns in ways that depend strongly on dataset, task success, and hidden formatting artifacts. Our results indicate that prefill awareness is already a substantial confound for some prefill-based methods. We recommend that model developers track this capability in frontier systems.

翻译：关于语言模型的安全相关研究，包括对齐与越狱评估、AI控制协议等，通常依赖于预填充模型输出。若AI模型能够识别并基于其先前助手的消息被插入或编辑的事实采取行动，这些方法的有效性和可靠性可能受到损害。我们研究了前沿语言模型能否区分被篡改和未被篡改的助手侧上下文，这一能力我们称之为预填充感知。为此，我们构建了一个基于三种预填充机制的二元偏好基准，通过筛选模型呈现一致立场的案例进行分析。研究发现前沿模型展现出显著的预填充感知能力：Claude Opus 4.5在9-35%的案例中能够检测到与其偏好相悖的预填充内容，且在触发检测时假阳性率为0%；此外，模型在未明确报告预填充内容来自外部的情况下，往往会回归基线行为。后续受控消融实验表明，检测与抵抗行为依赖于不同线索：风格不匹配主要影响模型是否标记预填充内容为外部，而偏好不匹配主要影响模型是否回归基线答案。我们还考察了更真实的智能体场景，如错位延续评估和SWE-bench轨迹，发现前沿模型有时会在部分场景中否认预填充的助手轮次，其表现强烈依赖于数据集、任务成功与否以及隐藏的格式伪影。我们的结果表明，预填充感知已成为某些基于预填充方法的重要混淆因素。我们建议模型开发者在前沿系统中追踪这一能力。