Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.
翻译:检索增强生成(RAG)通过将输出建立在检索证据的基础上,提升了大型语言模型(LLM)的事实准确性,但生成内容与提供来源相矛盾或超出范围的忠实性失效问题,仍然是关键挑战。现有的RAG幻觉检测方法通常依赖于大规模检测器训练(需要大量标注数据)或查询外部LLM评判器(导致高昂推理成本)。尽管部分方法尝试利用LLM的内部表示进行幻觉检测,其准确性仍受限。受机制可解释性最新进展的启发,我们采用稀疏自编码器(SAE)解耦内部激活,成功识别出在RAG幻觉期间被特异性触发的特征。基于系统化的信息特征选择与加性特征建模流程,我们提出了RAGLens——一种利用LLM内部表示精准标记非忠实RAG输出的轻量级幻觉检测器。RAGLens不仅相比现有方法实现了更优的检测性能,还能为其决策提供可解释的依据,从而有效实现非忠实RAG的事后缓解。最后,我们论证了设计选择的合理性,并揭示了LLM中幻觉相关信号分布的新见解。代码发布于https://github.com/Teddy-XiongGZ/RAGLens。