Language models (LMs) have been reported to implicitly encode character-level information, despite not being explicitly provided during training. However, the mechanisms underlying this phenomenon remain largely unexplored. To reveal the mechanisms, we analyze how models acquire character-level knowledge by comparing LMs trained under controlled settings, such as specifying the pre-training dataset or tokenizer, with those trained under standard settings. We categorize the contributing factors into those independent of tokenization. Our analysis reveals that merge rules and orthographic constraints constitute primary factors arising from tokenization, whereas semantic associations of substrings and syntactic information function as key factors independent of tokenization.
翻译:据报道,语言模型(LMs)能够隐式地编码字符级信息,尽管在训练过程中并未明确提供此类信息。然而,这一现象背后的机制在很大程度上仍未得到充分探索。为揭示这些机制,我们通过比较在受控设置(如指定预训练数据集或分词器)下训练的语言模型与在标准设置下训练的模型,分析了模型如何获取字符级知识。我们将影响因素归类为与分词无关的因素。我们的分析表明,合并规则和正字法约束是源于分词的主要因素,而子串的语义关联和句法信息则是与分词无关的关键因素。