Pre-deployment evaluations inspect only a limited sample of model actions. A malicious model seeking to evade oversight could exploit this by randomizing when to "defect": misbehaving so rarely that no malicious actions are observed during evaluation, but often enough that they occur eventually in deployment. But this requires taking actions at very low rates, while maintaining calibration. Are frontier models even capable of that? We prompt the GPT-5, Claude-4.5 and Qwen-3 families to take a target action at low probabilities (e.g. 0.01%), either given directly or requiring derivation, and evaluate their calibration (i.e. whether they perform the target action roughly 1 in 10,000 times when resampling). We find that frontier models are surprisingly good at this task. If there is a source of entropy in-context (such as a UUID), they maintain high calibration at rates lower than 1 in 100,000 actions. Without external entropy, some models can still reach rates lower than 1 in 10,000. When target rates are given, larger models achieve good calibration at lower rates. Yet, when models must derive the optimal target rate themselves, all models fail to achieve calibration without entropy or hint to generate it. Successful low-rate strategies require explicit Chain-of-Thought (CoT) reasoning, so malicious models attempting this approach could currently be caught by a CoT monitor. However, scaling trends suggest future evaluations may be unable to rely on models' lack of target rate calibration, especially if CoT is no longer legible.
翻译:预部署评估仅检查模型行为的有限样本。试图规避监管的恶意模型可能利用这一点,通过随机化"背叛"时机来实现:其不当行为发生频率极低,以致评估期间未观察到任何恶意行为,但又足够频繁,最终在部署过程中必然出现。但这要求模型能以极低概率执行特定行为,同时保持校准精度。前沿模型是否具备这种能力?我们提示GPT-5、Claude-4.5和Qwen-3系列模型以低概率(例如0.01%)执行目标行为——无论是直接给定概率还是需要推导概率,并评估其校准表现(即通过重复采样检验是否每约10,000次执行中仅出现1次目标行为)。研究发现前沿模型在此任务中表现出惊人的能力。当上下文存在熵源(如UUID)时,模型能在低于十万分之一的概率下保持高精度校准。即使没有外部熵源,部分模型仍能达到低于万分之一的目标概率。当给定目标概率时,更大规模的模型能在更低概率下实现良好校准。然而,当模型需要自行推导最优目标概率时,所有模型在缺乏熵源或生成提示的情况下均无法实现校准。成功的低概率策略需要显式的思维链推理,因此当前试图采用此方法的恶意模型可能被思维链监测器捕获。但模型规模扩展趋势表明,未来的评估可能无法依赖模型在目标概率校准方面的缺陷,特别是当思维链不再可解析时。