Cross-Model Disagreement as a Label-Free Correctness Signal

Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model's surprise at the generating model's answer tokens, and Cross-Model Entropy (CME), which measures the verifying model's uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.

翻译：在没有真实标签的情况下检测语言模型何时出错，是实现安全部署的基本挑战。现有方法依赖于模型自身的不确定性——例如词元熵或置信度分数——但这些信号在最危险的故障模式（即模型错误但有把握的自信性错误）中会严重失效。本文提出将跨模型分歧作为正确性指标——这是一种简单、无需训练的信号，可直接集成到现有生产系统、路由管道和部署监控基础设施中而无需修改。给定模型生成的答案后，跨模型分歧通过单次前向传播计算第二个验证模型在读取该答案时的惊讶程度或不确定性。该方法无需验证模型生成内容，也不需要任何正确性标签。我们将此原则实例化为跨模型困惑度（CMP）和跨模型熵（CME）：CMP衡量验证模型对生成模型答案词元的惊讶程度，CME衡量验证模型在这些位置的不确定性。在涵盖推理、检索和数学问题求解的基准测试（MMLU、TriviaQA、GSM8K）中，CMP和CME均优于模型内不确定性基线。在MMLU上，CMP的平均AUROC达到0.75，而模型内熵基线的平均AUROC为0.59。这些结果确立了跨模型分歧作为一种实用且无需训练的无标签正确性估计方法，可直接应用于生产语言模型系统的部署监控、模型路由、选择性预测、数据过滤和可扩展监督。