Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.

翻译：幻觉现象（即模型生成缺乏视觉证据支持的流畅文本）仍然是视觉-语言模型的主要缺陷，在手语翻译（SLT）中尤为关键。在SLT中，语义依赖于视频中的精确视觉基础，而无手语词标注（gloss-free）的模型尤其脆弱，因为它们将连续的手语动作直接映射到自然语言，缺乏作为对齐监督的中间手语词标注。我们认为，当模型依赖语言先验而非视觉输入时，就会产生幻觉。为捕捉这一现象，我们提出了一种词元级可靠性度量，用于量化解码器对视觉信息的使用程度。我们的方法结合了基于特征的敏感性（衡量视频被掩蔽时模型内部的变化）与反事实信号（捕捉原始视频输入与篡改视频输入之间的概率差异）。这些信号被聚合为句子级可靠性分数，从而提供了一种紧凑且可解释的视觉基础度量。我们在两个SLT基准数据集（PHOENIX-2014T和CSL-Daily）上，对基于手语词标注和无手语词标注的模型评估了所提出的度量方法。实验结果表明，可靠性能够预测幻觉发生率，在不同数据集和架构间具有泛化能力，并在视觉质量退化时降低。除了这些量化趋势外，我们还发现可靠性能够区分基于视觉基础的词元与猜测生成的词元，从而实现在无参考译文的情况下进行风险估计；当与基于文本的信号（置信度、困惑度或熵）结合时，它能进一步提升幻觉风险估计的准确性。定性分析揭示了无手语词标注模型更容易产生幻觉的原因。综上所述，我们的研究确立了可靠性作为诊断SLT幻觉现象的实用且可复用的工具，并为多模态生成中更鲁棒的幻觉检测奠定了基础。