Tokenization is a necessary component within the current architecture of many language models, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DM) is sufficient for reasonably human-like language performance, and that the emergence of human-meaningful linguistic units among tokens motivates linguistically-informed interventions in existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) semantic primitives and as (2) vehicles for conveying salient distributional patterns from human language to the model. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating sub-optimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokenization pretraining can be a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being meaningfully insulated from the main system intelligence.
翻译:在当前包括生成式人工智能中基于Transformer的大语言模型(LLMs)在内的许多语言模型架构中,词元化是一个必要的组成部分,然而其对模型认知的影响常被忽视。我们认为,LLMs证明了分布假说足以实现类人的合理语言表现,并且词元中涌现出对人类有意义的语言单元,这促使我们在现有与语言学无关的词元化技术中引入基于语言学的干预,特别是在其作为(1)语义基元以及(2)将人类语言中显著的分布模式传递给模型的载体这两个角色方面。我们探究了来自BPE词元化器的词元化方案;从Hugging Face和tiktoken获取的现有模型词汇表;以及示例词元向量在RoBERTa(大型)模型各层间传递时所包含的信息。除了创建次优的语义构建模块和模糊模型对必要分布模式的访问外,我们还描述了词元化预训练如何成为偏见及其他不良内容的后门,而当前的校准实践可能无法纠正这一问题。此外,我们提供的证据表明,尽管词元化算法的目标函数在相当程度上与系统主体智能相隔离,但它仍会影响LLM的认知。