Epistemic Observability in Language Models

We find that models report highest confidence precisely when they are fabricating. Across four model families (OLMo-3, Llama-3.1, Qwen3, Mistral), self-reported confidence inversely correlates with accuracy, with AUC ranging from 0.28 to 0.36 where 0.5 is random guessing. We prove, under explicit formal assumptions, that this is not a capability gap but an observational one. Under text-only observation, where a supervisor sees only the model's output text, no monitoring system can reliably distinguish honest model outputs from plausible fabrications. We prove two results: first, that any policy conditioning only on the query cannot satisfy epistemic honesty across ambiguous world states; second, that no learning algorithm optimizing reward from a text-only supervisor can converge to honest behavior when the supervisor's observations are identical for both grounded and fabricated responses. Within our formal model, these impossibilities hold regardless of model scale or training procedure, including RLHF and instruction tuning. We construct a tensor interface that escapes the impossibility by exporting computational byproducts (per-token entropy and log-probability distributions) that are structurally coupled to correctness under standard training. Per-token entropy achieves pooled AUC 0.757, outperforming all text baselines by 2.5--3.9 percentage points at every budget level tested (10\%, 20\%, 30\%). The entropy signal generalizes across architectures (Spearman $ρ= 0.762$). The core contribution is a cost surface where the empirical mapping from verification budget (fraction of queries receiving expensive checks) to detection accuracy for each judge strategy is a practical lookup for system builders deciding how to allocate verification resources. The contribution is the map. The territory is the system you are building.

翻译：我们发现，模型在其编造内容时报告的信心最高。在四个模型家族（OLMo-3、Llama-3.1、Qwen3、Mistral）中，自我报告信心与准确性呈负相关，AUC在0.28至0.36之间（0.5为随机猜测）。在明确的形式化假设下，我们证明这并非能力差距，而是观察性差距。在仅限文本观察（即监督者仅看到模型输出文本）的条件下，任何监控系统都无法可靠地区分诚实的模型输出与似是而非的编造内容。我们证明了两个结果：第一，任何仅基于查询进行条件化的策略无法在不明确的世界状态下实现认识诚实性；第二，当监督者对于基于事实和编造的回答观察到相同内容时，任何从仅文本监督者优化奖励的学习算法都无法收敛到诚实行为。在我们的形式化模型中，无论模型规模或训练过程（包括RLHF和指令微调）如何，这些不可能性都成立。我们构建了一个张量接口，通过导出计算副产品（逐词元熵和对数概率分布）来克服这一不可能性，这些副产品在标准训练下与正确性在结构上相关。逐词元熵实现了池化AUC 0.757，在测试的每个预算水平（10%、20%、30%）上均优于所有文本基线2.5-3.9个百分点。熵信号跨架构泛化（Spearman ρ=0.762）。核心贡献在于一个成本曲面，其中从验证预算（接受昂贵检查的查询比例）到每种评判策略检测准确率的经验映射，为系统构建者决定如何分配验证资源提供了实用查找表。贡献在于这张地图。疆域是你所构建的系统。