Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model's forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this "in-advance correctness direction" trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, indicating a deeper signal than dataset-specific spurious features, and outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers and, notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding "I don't know", doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.
翻译:大型语言模型(LLM)能否预知自身回答的正确性?为探究此问题,我们在模型读取问题后、生成任何词元前提取激活值,并训练线性探针以预测模型即将生成的答案是否正确。通过对三个参数量从70亿到700亿的开源模型系列进行实验,基于通用常识问题训练的"预先正确性方向"投影在分布内及多样化的分布外知识数据集上均能有效预测回答成功率,这表明该信号超越了数据集特定伪特征的层面,其表现优于黑盒基线方法和语言化置信度预测。预测能力在中间层达到饱和,且值得注意的是,该方法在需要数学推理的问题上泛化能力显著下降。此外,对于以"我不知道"回应的模型,该回应与探针评分呈强相关性,表明同一方向亦能捕捉模型置信度。通过补充先前利用探针和稀疏自编码器获得的关于真实性及其他行为的研究结果,我们的工作为阐释大语言模型内部机制提供了关键性发现。