We design probes trained on the internal representations of a transformer language model that are predictive of its hallucinatory behavior on in-context generation tasks. To facilitate this detection, we create a span-annotated dataset of organic and synthetic hallucinations over several tasks. We find that probes trained on the force-decoded states of synthetic hallucinations are generally ecologically invalid in organic hallucination detection. Furthermore, hidden state information about hallucination appears to be task and distribution-dependent. Intrinsic and extrinsic hallucination saliency varies across layers, hidden state types, and tasks; notably, extrinsic hallucinations tend to be more salient in a transformer's internal representations. Outperforming multiple contemporary baselines, we show that probing is a feasible and efficient alternative to language model hallucination evaluation when model states are available.
翻译:我们针对Transformer语言模型的内部表征训练探针,用于预测模型在上下文生成任务中的幻觉行为。为辅助检测,我们构建了涵盖多项任务的有机与合成幻觉的跨度标注数据集。研究发现,基于合成幻觉强制解码状态训练的探针在有机幻觉检测中普遍存在生态效度缺失问题。此外,关于幻觉的隐状态信息呈现任务与分布依赖性特征。内在幻觉与外在幻觉的显著性随层数、隐状态类型及任务维度变化——值得注意的是,外在幻觉在Transformer内部表征中往往更具显著性。超越多个当代基线方法,我们证明当模型状态可获取时,探针分析是语言模型幻觉评估的可行且高效替代方案。