Inspecting the information encoded in hidden representations of large language models (LLMs) can explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer a wide range of questions about an LLM's computation. We show that prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation can be viewed as instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by Patchscopes. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and unlocks new applications such as self-correction in multi-hop reasoning.
翻译:检测大型语言模型(LLMs)隐藏表示中编码的信息,可以解释模型行为并验证其与人类价值观的一致性。鉴于LLMs生成人类可理解文本的能力,我们提出利用模型本身以自然语言解释其内部表示。我们引入了名为Patchscopes的框架,并展示了如何利用它回答有关LLM计算的各种问题。研究表明:早期基于将表示投影到词汇空间并干预LLM计算的解释性方法,均可视为该框架的特例。此外,这类方法在早期层检测失败或表达力不足等缺陷,可通过Patchscopes得到缓解。在统一现有检测技术的基础上,Patchscopes还开辟了新的可能性,例如用更强大的模型解释较小模型的表示,并解锁了多跳推理中的自我修正等新应用场景。