Large Language Models (LLMs) have achieved remarkable success, where instruction tuning is the critical step in aligning LLMs with user intentions. In this work, we investigate how the instruction tuning adjusts pre-trained models with a focus on intrinsic changes. Specifically, we first develop several local and global explanation methods, including a gradient-based method for input-output attribution, and techniques for interpreting patterns and concepts in self-attention and feed-forward layers. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. This approach provides an internal perspective of the model shifts on a human-comprehensible level. Our findings reveal three significant impacts of instruction tuning: 1) It empowers LLMs to recognize the instruction parts of user prompts, and promotes the response generation constantly conditioned on the instructions. 2) It encourages the self-attention heads to capture more word-word relationships about instruction verbs. 3) It encourages the feed-forward networks to rotate their pre-trained knowledge toward user-oriented tasks. These insights contribute to a more comprehensive understanding of instruction tuning and lay the groundwork for future work that aims at explaining and optimizing LLMs for various applications. Our code and data are publicly available at https://github.com/JacksonWuxs/Interpret_Instruction_Tuning_LLMs.
翻译:大语言模型(LLMs)已取得显著成功,其中指令微调是使LLMs与用户意图对齐的关键步骤。本研究从内在变化角度探究指令微调如何调整预训练模型。具体而言,我们首先开发了多种局部和全局解释方法,包括基于梯度的输入-输出归因方法,以及用于阐释自注意力层和前馈层中模式与概念的技术。通过对比预训练模型与指令微调模型生成的解释,研究指令微调的影响。该方法以人类可理解的层面提供了模型转变的内在视角。我们的发现揭示了指令微调的三项重要影响:1)它使LLMs能够识别用户提示中的指令部分,并促使其响应生成持续以指令为条件;2)它鼓励自注意力头捕捉更多关于指令动词的词语间关系;3)它鼓励前馈网络将预训练知识向用户导向任务方向旋转。这些见解有助于更全面地理解指令微调,并为未来旨在解释和优化LLMs以适应各类应用的研究奠定基础。我们的代码和数据已在https://github.com/JacksonWuxs/Interpret_Instruction_Tuning_LLMs 公开。