Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs

Large Language Models (LLMs) frequently exhibit strong translation abilities, even without task-specific fine-tuning. However, the internal mechanisms governing this innate capability remain largely opaque. To demystify this process, we leverage Sparse Autoencoders (SAEs) and introduce a novel framework for identifying task-specific features. Our method first recalls features that are frequently co-activated on translation inputs and then filters them for functional coherence using a PCA-based consistency metric. This framework successfully isolates a small set of **translation initiation** features. Causal interventions demonstrate that amplifying these features steers the model towards correct translation, while ablating them induces hallucinations and off-task outputs, confirming they represent a core component of the model's innate translation competency. Moving from analysis to application, we leverage this mechanistic insight to propose a new data selection strategy for efficient fine-tuning. Specifically, we prioritize training on **mechanistically hard** samples-those that fail to naturally activate the translation initiation features. Experiments show this approach significantly improves data efficiency and suppresses hallucinations. Furthermore, we find these mechanisms are transferable to larger models of the same family. Our work not only decodes a core component of the translation mechanism in LLMs but also provides a blueprint for using internal model mechanism to create more robust and efficient models. The codes are available at https://github.com/flamewei123/AAAI26-translation-Initiation-Features.

翻译：大语言模型（LLMs）即使未经任务特定的微调，也常展现出强大的翻译能力。然而，控制这种内在能力的内部机制在很大程度上仍不透明。为阐明这一过程，我们利用稀疏自编码器（SAEs）并引入了一种识别任务特定特征的新框架。我们的方法首先召回在翻译输入上频繁共激活的特征，然后使用基于主成分分析（PCA）的一致性度量对其进行功能一致性过滤。该框架成功分离出一小部分**翻译启动**特征。因果干预实验表明，增强这些特征会引导模型进行正确翻译，而消融它们则会导致幻觉和偏离任务的输出，从而证实它们代表了模型内在翻译能力的核心组成部分。从分析转向应用，我们利用这一机制性见解提出了一种用于高效微调的新数据选择策略。具体而言，我们优先在**机制性困难**样本（即那些未能自然激活翻译启动特征的样本）上进行训练。实验表明，该方法显著提高了数据效率并抑制了幻觉。此外，我们发现这些机制可迁移到同一家族的更大模型中。我们的工作不仅解码了LLMs翻译机制的一个核心组成部分，还为利用内部模型机制创建更稳健、更高效的模型提供了蓝图。代码可在 https://github.com/flamewei123/AAAI26-translation-Initiation-Features 获取。