Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoEncoder (SAE) latents. We show that both spaces encode linearly separable hallucination-related information, with discriminative power concentrated in a sparse feature subset and increasing toward deeper encoder layers. We propose two steering strategies: activation-space steering and SAE latent-space steering. SAE-based steering reduces hallucination rate from 72.63% to 14.11% for Whisper small and from 86.88% to 27.33% for Whisper large-v3 on the full non-speech test set, with small WER degradation on speech data, approaching the performance of fine-tuning-based methods.
翻译:Whisper 作为一种广泛采用的自动语音识别(ASR)模型,存在产生幻觉的已知问题——即在非语音音频输入上生成与输入完全无关的连贯转录文本。本研究探究能否通过 Whisper 的内部表征来检测并缓解幻觉。我们提取了音频编码器的激活值,并对两种表征空间进行了评估:原始 Whisper 激活值与稀疏自编码器(SAE)隐变量。结果表明,这两个空间均编码了线性可分的幻觉相关信息,其判别能力集中在稀疏的特征子集中,并随编码器层数加深而增强。我们提出了两种调控策略:激活空间调控与 SAE 隐空间调控。基于 SAE 的调控方法在完整非语音测试集上,将 Whisper small 模型的幻觉率从 72.63% 降至 14.11%,将 Whisper large-v3 模型的幻觉率从 86.88% 降至 27.33%,同时仅在语音数据上产生较小的词错误率(WER)退化,其性能接近基于微调的方法。