Decoy-Calibrated Failure Audits for Language Models

Useful audits reveal not only how often a model fails, but also where its failures concentrate. An auditor may test many candidate explanations: long inputs, indirect questions, distracting evidence, or combinations of these factors. The risk is selection. The largest observed effect may reflect a real failure mode, or it may simply be the best result among many tried. We introduce Janus, a procedure for deciding when a proposed error explanation is credible enough to report. The goal is not to generate new explanations, but to decide which ones hold up. The auditor starts with a fixed model, a labeled evaluation set, and a frozen list of candidate explanations, which we call descriptors. Janus scores each descriptor by its error-rate lift, then compares real descriptors with fake ones that have the same frequencies but are randomly assigned to examples. A descriptor is confirmed only if it beats this decoy floor on the data used for discovery and then repeats on separate held-out data. In a controlled audit of multi-table lookup tasks, Janus identifies the planted failure, confirming long-chain descriptors and their interactions. The LLM often stops partway through the lookup chain instead of reaching the final answer. On two public benchmarks, MuSiQue and LongBench v2, the SliceLine baseline flags plausible high-error pockets, but Janus confirms none of them. Ablations show why both safeguards matter. On LongBench v2, an uncalibrated fixed threshold reports 20 descriptors, the decoy floor leaves one, and the holdout check rejects the last one after its lift shrinks from 0.36 to 0.05. The resulting principle separates proposing explanations from reporting them. Candidates may come from any source, but only those that beat decoys and replicate on fresh data become audit findings.

翻译：有用的审计不仅揭示模型失败的频率，还揭示其失败集中的领域。审计员可能需要测试多种候选解释：长输入、间接问题、干扰性证据，或这些因素的组合。风险在于选择性报告——观察到的最大效应可能反映真实的失败模式，也可能仅仅是多次尝试中的最佳结果。我们提出Janus，一种判定拟议错误解释是否足够可信以供报告的程序。其目标并非生成新解释，而是甄别哪些解释经得起考验。审计员从固定模型、标注评估集及冻结的候选解释列表（称为描述符）开始。Janus通过错误率提升度对每个描述符评分，然后将真实描述符与具有相同频率但随机分配给样本的假描述符进行比较。仅当描述符在发现数据集上击败此诱饵基准线，并在独立保留数据集上重复验证时，该描述符才被确认为有效。在对多表格查找任务的受控审计中，Janus识别了植入的失败模式，确认了长链描述符及其交互作用——大语言模型通常在查找链中途停止，而非得出最终答案。在两个公开基准MuSiQue和LongBench v2上，SliceLine基线标记了疑似的高错误区域，但Janus未确认其中任何一个。消融实验揭示了两项保障措施的必要性：在LongBench v2上，未校准的固定阈值报告了20个描述符，诱饵基准线仅保留一个，而保留数据检验在其提升度从0.36缩水至0.05后拒绝该最后一个描述符。由此产生的原则将解释提出与报告分离开来。候选解释可来自任何来源，但只有那些击败诱饵且在新鲜数据上可复现的解释才能成为审计发现。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《军事大语言模型的拒绝率测量与消除》

专知会员服务

14+阅读 · 3月13日

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

大型语言模型系统中提示缺陷的分类学

专知会员服务

8+阅读 · 2025年9月19日

《大语言模型中的对齐伪造》最新137页

专知会员服务

11+阅读 · 2025年1月27日