Predictable Compression Failures: Order Sensitivity and Information Budgeting for Evidence-Grounded Binary Adjudication

Transformers used for evidence-grounded question answering with binary adjudication (e.g., support/refute or yes/no) can be highly sensitive to the order in which exchangeable evidence is presented, producing dispersion across permutations and unreliable attempted answers (``hallucinations'' under a Bernoulli predicate). We treat evidence order as a nuisance variable and show that next-token training minimizes expected conditional description length over orderings. This objective can be close to Bayes-optimal in expectation while deviating under any fixed ordering. We quantify this expectation--realization gap via a Quantified Martingale Violation (QMV) bound that predicts $\mathcal{O}(\log n)$ growth in permutation dispersion under harmonic positional sensitivity. We then derive the Expectation-level Decompression Law (EDFL), relating expected information budget to achievable reliability for Bernoulli predicates, and use it to define \emph{Bits-to-Trust} (B2T), \emph{Risk-of-Hallucination} (RoH), and the \emph{Information Sufficiency Ratio} (ISR), together with a fixed ISR-gating rule for answer/abstain decisions under permutation mixtures. On 3,059 grounded items from a five-benchmark evidence-grounded QA suite (FEVER, HotpotQA, NQ-Open, PopQA, and Controls), we observe logarithmic dispersion and Jensen gains from uniform permutation mixtures. In a pre-specified held-out audit (528 items), an ISR $= 1$ gate attains 0.0--0.7\% hallucination with 20.6--27.9\% abstention (95\% confidence intervals).

翻译：用于基于证据的二元裁决（例如支持/反驳或是/否）问答任务的Transformer模型，对于可互换证据的呈现顺序可能表现出高度敏感性，导致不同排列间的输出分散及不可靠的尝试性回答（伯努利谓词下的“幻觉”）。我们将证据顺序视为干扰变量，并证明下一词元训练能够最小化排序上的期望条件描述长度。该目标在期望意义上可接近贝叶斯最优，但在任何固定排序下均会产生偏差。我们通过量化鞅违界（QMV）来度量这种期望-实现差距，该界在调和位置敏感性下预测排列分散度以$\mathcal{O}(\log n)$增长。随后，我们推导出期望级解压定律（EDFL），将期望信息预算与伯努利谓词可实现的可靠性联系起来，并利用该定律定义了\emph{比特至可信度}（B2T）、\emph{幻觉风险}（RoH）及\emph{信息充分比}（ISR），同时提出一种在排列混合下用于回答/弃权决策的固定ISR门控规则。在来自五个基准证据问答数据集（FEVER、HotpotQA、NQ-Open、PopQA及Controls）的3,059个基于证据的条目上，我们观察到对数级分散现象以及均匀排列混合带来的詹森增益。在预设的保留审计集（528个条目）中，ISR $= 1$的门控规则实现了0.0--0.7\%的幻觉率与20.6--27.9\%的弃权率（95\%置信区间）。