LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.
翻译:大语言模型将文本处理为大致对应单词的标记序列,其中较不常见的单词由多个标记表示。然而,单个标记通常与其组成的单词/概念含义在语义上无关。例如,Llama-2-7b的分词器将单词"northeastern"拆分为标记['_n', 'ort', 'he', 'astern'],这些标记均不对应"north"或"east"等语义上有意义的单元。类似地,诸如"Neil Young"等命名实体以及"break a leg"等多词表达的整体含义无法直接从其组成标记推断。从机制上讲,大语言模型如何将此类任意标记组转换为有用的高层表示?在本研究中,我们发现命名实体和多标记单词的末位标记表示表现出显著的"擦除"效应,即在前几层中关于先前及当前标记的信息被快速遗忘。基于此观察,我们提出一种通过检查标记表示在层间差异来"读取"自回归大语言模型隐含词汇的方法,并展示了该方法在Llama-2-7b和Llama-3-8B上的应用结果。据我们所知,这是首次尝试探测大语言模型的隐含词汇。