We propose an auditing method to identify whether a large language model (LLM) encodes patterns such as hallucinations in its internal states, which may propagate to downstream tasks. We introduce a weakly supervised auditing technique using a subset scanning approach to detect anomalous patterns in LLM activations from pre-trained models. Importantly, our method does not need knowledge of the type of patterns a-priori. Instead, it relies on a reference dataset devoid of anomalies during testing. Further, our approach enables the identification of pivotal nodes responsible for encoding these patterns, which may offer crucial insights for fine-tuning specific sub-networks for bias mitigation. We introduce two new scanning methods to handle LLM activations for anomalous sentences that may deviate from the expected distribution in either direction. Our results confirm prior findings of BERT's limited internal capacity for encoding hallucinations, while OPT appears capable of encoding hallucination information internally. Importantly, our scanning approach, without prior exposure to false statements, performs comparably to a fully supervised out-of-distribution classifier.
翻译:我们提出一种审计方法,用于识别大语言模型是否在其内部状态中编码了如幻觉等模式——这些模式可能传播至下游任务。我们引入了一种基于子集扫描的弱监督审计技术,用于检测预训练模型激活中的异常模式。值得注意的是,本方法无需预先了解模式类型,仅依赖测试过程中不含异常特征的参考数据集。此外,我们的方法能够识别编码这些模式的关键节点,为通过微调特定子网络来缓解偏差提供重要洞见。我们提出两种新型扫描方法,用于处理可能朝任意方向偏离预期分布的大语言模型异常句子激活。实验结果证实了先前关于BERT内部编码幻觉能力有限的研究结论,而OPT模型则展现出内部编码幻觉信息的能力。值得强调的是,本扫描方法虽未预先接触虚假陈述,其性能却与完全监督的分布外分类器表现相当。