Large language models (LLMs) trained with canonical tokenization exhibit surprising robustness to non-canonical inputs such as character-level tokenization, yet the mechanisms underlying this robustness remain unclear. We study this phenomenon through mechanistic interpretability and identify a core process we term word recovery. We first introduce a decoding-based method to detect word recovery, showing that hidden states reconstruct canonical word-level token identities from character-level inputs. We then provide causal evidence by removing the corresponding subspace from hidden states, which consistently degrades downstream task performance. Finally, we conduct a fine-grained attention analysis and show that in-group attention among characters belonging to the same canonical token is critical for word recovery: masking such attention in early layers substantially reduces both recovery scores and task performance. Together, our findings provide a mechanistic explanation for tokenization robustness and identify word recovery as a key mechanism enabling LLMs to process character-level inputs.
翻译:采用规范分词训练的大型语言模型(LLMs)对非规范输入(如字符级分词)展现出惊人的鲁棒性,但其内在机制尚不明确。本研究通过可解释性机制分析这一现象,发现了一个核心过程——词汇恢复。我们首先提出基于解码的检测方法,证明隐藏状态能够从字符级输入中重构规范词汇级分词标识。随后通过因果性验证,在隐藏状态中移除相应子空间会导致下游任务性能持续下降。最后进行细粒度注意力分析,发现属于同一规范分词的字符组内注意力对词汇恢复至关重要:在早期层掩蔽此类注意力会显著降低恢复分数与任务性能。综合而言,本研究为分词鲁棒性提供了机制性解释,并确立词汇恢复作为LLMs处理字符级输入的关键机制。