Transformers used for evidence-grounded question answering with binary adjudication (e.g., support/refute or yes/no) can be highly sensitive to the order in which exchangeable evidence is presented, producing dispersion across permutations and unreliable attempted answers (``hallucinations'' under a Bernoulli predicate). We treat evidence order as a nuisance variable and show that next-token training minimizes expected conditional description length over orderings. This objective can be close to Bayes-optimal in expectation while deviating under any fixed ordering. We quantify this expectation--realization gap via a Quantified Martingale Violation (QMV) bound that predicts $\mathcal{O}(\log n)$ growth in permutation dispersion under harmonic positional sensitivity. We then derive the Expectation-level Decompression Law (EDFL), relating expected information budget to achievable reliability for Bernoulli predicates, and use it to define \emph{Bits-to-Trust} (B2T), \emph{Risk-of-Hallucination} (RoH), and the \emph{Information Sufficiency Ratio} (ISR), together with a fixed ISR-gating rule for answer/abstain decisions under permutation mixtures. On 3,059 grounded items from a five-benchmark evidence-grounded QA suite (FEVER, HotpotQA, NQ-Open, PopQA, and Controls), we observe logarithmic dispersion and Jensen gains from uniform permutation mixtures. In a pre-specified held-out audit (528 items), an ISR $= 1$ gate attains 0.0--0.7\% hallucination with 20.6--27.9\% abstention (95\% confidence intervals).
翻译:用于基于证据的二元裁决(例如支持/反驳或是/否)问答任务的Transformer模型,对于可互换证据的呈现顺序可能表现出高度敏感性,导致不同排列间的输出分散及不可靠的尝试性回答(伯努利谓词下的“幻觉”)。我们将证据顺序视为干扰变量,并证明下一词元训练能够最小化排序上的期望条件描述长度。该目标在期望意义上可接近贝叶斯最优,但在任何固定排序下均会产生偏差。我们通过量化鞅违界(QMV)来度量这种期望-实现差距,该界在调和位置敏感性下预测排列分散度以$\mathcal{O}(\log n)$增长。随后,我们推导出期望级解压定律(EDFL),将期望信息预算与伯努利谓词可实现的可靠性联系起来,并利用该定律定义了\emph{比特至可信度}(B2T)、\emph{幻觉风险}(RoH)及\emph{信息充分比}(ISR),同时提出一种在排列混合下用于回答/弃权决策的固定ISR门控规则。在来自五个基准证据问答数据集(FEVER、HotpotQA、NQ-Open、PopQA及Controls)的3,059个基于证据的条目上,我们观察到对数级分散现象以及均匀排列混合带来的詹森增益。在预设的保留审计集(528个条目)中,ISR $= 1$的门控规则实现了0.0--0.7\%的幻觉率与20.6--27.9\%的弃权率(95\%置信区间)。