Modern neural networks often produce miscalibrated confidence scores and struggle to detect out-of-distribution (OOD) inputs, while most existing methods post-process outputs without testing internal consistency. We introduce the Bag-of-Coins (BoC) probe, a non-parametric diagnostic of logit coherence that compares softmax confidence $\hat p$ to an aggregate of pairwise Luce-style dominance probabilities $\bar q$, yielding a deterministic coherence score and a p-value-based structural score. Across ViT, ResNet, and RoBERTa with ID/OOD test sets, the coherence gap $Δ=\bar q-\hat p$ reveals clear ID/OOD separation for ViT (ID ${\sim}0.1$-$0.2$, OOD ${\sim}0.5$-$0.6$) but substantial overlap for ResNet and RoBERTa (both ${\sim}0$), indicating architecture-dependent uncertainty geometry. As a practical method, BoC improves calibration only when the base model is poorly calibrated (ViT: ECE $0.024$ vs.\ $0.180$) and underperforms standard calibrators (ECE ${\sim}0.005$), while for OOD detection it fails across architectures (AUROC $0.020$-$0.253$) compared to standard scores ($0.75$-$0.99$). We position BoC as a research diagnostic for interrogating how architectures encode uncertainty in logit geometry rather than a production calibration or OOD detection method.
翻译:现代神经网络常产生校准不当的置信度分数,且难以检测分布外输入,而现有方法多通过后处理输出而未检验内部一致性。本文提出硬币袋探针——一种非参数化的对数几率一致性诊断方法,通过比较Softmax置信度$\hat p$与成对Luce式支配概率聚合值$\bar q$,生成确定性一致性分数和基于p值的结构分数。在ViT、ResNet和RoBERTa模型上使用分布内/外测试集的实验表明:一致性间隙$Δ=\bar q-\hat p$在ViT中呈现清晰的分布内外分离(分布内${\sim}0.1$-$0.2$,分布外${\sim}0.5$-$0.6$),但在ResNet和RoBERTa中则存在显著重叠(二者均${\sim}0$),这揭示了架构依赖的不确定性几何结构。作为实用方法,BoC仅在基础模型校准较差时能改进校准(ViT:ECE $0.024$对比$0.180$),但逊于标准校准器(ECE ${\sim}0.005$);在分布外检测任务中,相比标准评分方法(AUROC $0.75$-$0.99$),BoC在所有架构上均表现不佳(AUROC $0.020$-$0.253$)。我们将BoC定位为研究诊断工具,用于探究不同架构如何在对数几率几何中编码不确定性,而非生产级校准或分布外检测方法。