Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model's surprise at the generating model's answer tokens, and Cross-Model Entropy (CME), which measures the verifying model's uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.
翻译:检测语言模型在无真实标签时何时犯错,是安全部署中的一项根本性挑战。现有方法依赖模型自身的置信度(如词元熵或置信度分数),但这些信号在最具危险性的故障模式——即模型错误但自信的“自信错误”中——会严重失效。在本研究中,我们提出将跨模型分歧作为一种正确性指标:一种简单、无需训练的指标,可直接嵌入现有生产系统、路由管道和部署监控基础设施中而无需修改。给定模型生成的答案,跨模型分歧通过单次前向传播计算第二个验证模型在阅读该答案时的惊讶度或不确定性。该过程无需验证模型进行生成,也无需正确性标签。我们将这一原则实例化为跨模型困惑度(CMP)和跨模型熵(CME):CMP衡量验证模型对生成模型答案词元的惊讶度,CME衡量验证模型在这些位置上的不确定性。在涵盖推理、检索和数学问题求解的基准测试(MMLU、TriviaQA和GSM8K)中,CMP和CME均优于模型内不确定性基线。在MMLU上,CMP的平均AUROC达到0.75,而模型内熵基线仅为0.59。这些结果表明,跨模型分歧是一种实用且无需训练的无标签正确性估计方法,可直接应用于生产语言模型系统的部署监控、模型路由、选择性预测、数据过滤和可扩展监督。