Mechanistic Interpretability (MI) seeks to explain how neural networks implement their capabilities, but the scale of Large Language Models (LLMs) has limited prior MI work in Machine Translation (MT) to word-level analyses. We study sentence-level MT from a mechanistic perspective by analyzing attention heads to understand how LLMs internally encode and distribute translation functions. We decompose MT into two subtasks: producing text in the target language (i.e. target language identification) and preserving the input sentence's meaning (i.e. sentence equivalence). Across three families of open-source models and 20 translation directions, we find that distinct, sparse sets of attention heads specialize in each subtask. Based on this insight, we construct subtask-specific steering vectors and show that modifying just 1% of the relevant heads enables instruction-free MT performance comparable to instruction-based prompting, while ablating these heads selectively disrupts their corresponding translation functions.
翻译:机制可解释性旨在揭示神经网络如何实现其功能,但由于大语言模型的规模限制,先前机器翻译领域的机制可解释性研究多局限于词汇层面分析。本文从机制视角研究句子级机器翻译,通过分析注意力头来理解大语言模型内部如何编码和分配翻译功能。我们将机器翻译分解为两个子任务:生成目标语言文本(即目标语言识别)和保持输入句子的语义(即句子等价性)。通过对三个开源模型系列和20个翻译方向的研究,我们发现不同且稀疏的注意力头集合专门负责各个子任务。基于这一发现,我们构建了子任务特定的导向向量,并证明仅修改1%的相关注意力头即可实现与基于指令提示相媲美的无指令机器翻译性能,而选择性消融这些注意力头则会破坏其对应的翻译功能。