Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization

Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We extend this analysis to all six decoder layers of the Fairseq DE-EN model ($N=3{,}414$), showing that Wass-to-Unif and Wass-to-Data are complementary detectors specialised across hallucination types, that detection is concentrated in layers L1--L4 with L5 anti-predictive for subtler types, and that hallucinated translations lack the exploratory attention phase present in correct translations from the first decoding step. We further evaluate whether the geometric signal transfers to abstractive summarization faithfulness detection: our unsupervised OT detector on AggreFact ($N=1{,}116$) achieves $57.2\%$/$57.6\%$ balanced accuracy on CNN/XSum -- above chance but substantially below supervised MiniCheck-Flan-T5-L($69.9\%$/$74.3\%$). This gap is principled: unlike NMT hallucinations, unfaithful summaries can attend correctly to source tokens while misrepresenting their content, a failure mode invisible to concentration-based OT metrics by construction. Structural experiments on T5-base confirm consistent decoder organisation across depth, with Layer~3 showing peak concentration and Layer~12 being most critical for generation quality. Together, the results establish OT on cross-attention as a reliable detector when the failure mode is source disengagement, a principled interpretability tool regardless of task, and fundamentally limited when faithfulness failures occur downstream of attention.

翻译：最优传输（OT）已被证明可通过无监督方式测量交叉注意力分布与参考分布之间的几何距离，从而检测神经机器翻译（NMT）中的幻觉现象。我们将该分析扩展至Fairseq DE-EN模型（$N=3{,}414$）全部六个解码器层，发现Wass-to-Unif与Wass-to-Data是分别专注于不同幻觉类型的互补检测器，检测能力集中于L1–L4层，而L5层对细微幻觉类型呈现反预测性，且幻觉翻译在初始解码步骤中缺失正确翻译所表现出的探索性注意力阶段。我们进一步评估该几何信号能否迁移至抽象式摘要的忠实性检测：我们的无监督OT检测器在AggreFact数据集（$N=1{,}116$）上对CNN/XSum的平衡准确率分别为$57.2\%$/$57.6\%$——虽高于随机水平，但显著低于有监督的MiniCheck-Flan-T5-L（$69.9\%$/$74.3\%$）。这一差距具有原理性：与NMT幻觉不同，不忠实摘要可在正确关注源端标记的同时扭曲其内容，这一失败模式按设计无法被基于集中度的OT度量所捕获。基于T5-base的结构性实验证实解码器组织方式在不同深度上保持一致性，其中第3层呈现峰值集中度，而第12层对生成质量最为关键。综上结果表明：当失败模式源于源端脱离时，交叉注意力上的OT可成为可靠检测器；无论任务类型如何，它均是具有原理性的可解释性工具；但当忠实性失败发生在注意力之后的下游环节时，其应用存在根本性局限。