How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings), a framework that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond inquiry about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings also open up new avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.
翻译:大语言模型(LLM)如何获得其答案?解释和控制LLM推理过程的能力对于可靠性、透明度和未来模型开发至关重要。我们提出SelfIE(嵌入的自我解释)框架,该框架利用LLM回答给定段落相关查询的能力,使其能够用自然语言解释自身的嵌入。SelfIE能够解释隐藏嵌入中的开放世界概念,揭示了LLM在道德决策、内化提示注入和回忆有害知识等情形下的内部推理过程。SelfIE对隐藏嵌入的文本描述也为控制LLM推理开辟了新途径。我们提出了监督控制方法,该方法仅需计算单层梯度即可编辑开放式概念。我们将RLHF扩展到隐藏嵌入,并提出了无需监督目标即可擦除LLM中有害知识的强化控制方法。