Current evaluations of LLM safety predominantly rely on severity-based taxonomies to assess the harmfulness of malicious queries. We argue that this formulation requires re-examination as it assumes uniform risk across all malicious queries, neglecting Execution Likelihood--the conditional probability of a threat being realized given the model's response. In this work, we introduce Expected Harm, a metric that weights the severity of a jailbreak by its execution likelihood, modeled as a function of execution cost. Through empirical analysis of state-of-the-art models, we reveal a systematic Inverse Risk Calibration: models disproportionately exhibit stronger refusal behaviors for low-likelihood (high-cost) threats while remaining vulnerable to high-likelihood (low-cost) queries. We demonstrate that this miscalibration creates a structural vulnerability: by exploiting this property, we increase the attack success rate of existing jailbreaks by up to $2\times$. Finally, we trace the root cause of this failure using linear probing, which reveals that while models encode severity in their latent space to drive refusal decisions, they possess no distinguishable internal representation of execution cost, making them "blind" to this critical dimension of risk.
翻译:当前对大型语言模型安全性的评估主要依赖基于严重程度的分类体系来评估恶意查询的危害性。我们认为这一框架需要重新审视,因为它假设所有恶意查询具有统一的风险,而忽略了执行可能性——即在给定模型响应的情况下威胁被实现的条件概率。在本研究中,我们提出了预期危害这一指标,该指标通过执行可能性对越狱攻击的严重程度进行加权,并将执行可能性建模为执行成本的函数。通过对前沿模型的实证分析,我们揭示了一种系统性的逆向风险校准现象:模型对低可能性(高成本)威胁表现出过强的拒绝行为,而对高可能性(低成本)查询则保持脆弱性。我们证明这种校准失准造成了结构性漏洞:通过利用这一特性,我们将现有越狱攻击的成功率提升了高达$2\times$。最后,我们通过线性探测追溯了这一失效的根本原因:虽然模型在其潜在空间中编码了严重性信息以驱动拒绝决策,但它们缺乏对执行成本的可区分内部表征,导致模型对这一关键风险维度处于"盲视"状态。