Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Julia Witte Zimmerman,Denis Hudon,Kathryn Cramer,Alejandro J. Ruiz,Calla Beauregard,Ashley Fehr,Mikaela Irene Fudolig,Bradford Demarest,Yoshi Meke Bird,Milo Z. Trujillo,Christopher M. Danforth,Peter Sheridan Dodds

Tokenization is a necessary component within the current architecture of many language models, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model's cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DM) is sufficient for reasonably human-like language performance, and that the emergence of human-meaningful linguistic units among tokens motivates linguistically-informed interventions in existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) semantic primitives and as (2) vehicles for conveying salient distributional patterns from human language to the model. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating sub-optimal semantic building blocks and obscuring the model's access to the necessary distributional patterns, we describe how tokenization pretraining can be a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm's objective function impacts the LLM's cognition, despite being meaningfully insulated from the main system intelligence.

翻译：在当前包括生成式人工智能中基于Transformer的大语言模型（LLMs）在内的许多语言模型架构中，词元化是一个必要的组成部分，然而其对模型认知的影响常被忽视。我们认为，LLMs证明了分布假说足以实现类人的合理语言表现，并且词元中涌现出对人类有意义的语言单元，这促使我们在现有与语言学无关的词元化技术中引入基于语言学的干预，特别是在其作为（1）语义基元以及（2）将人类语言中显著的分布模式传递给模型的载体这两个角色方面。我们探究了来自BPE词元化器的词元化方案；从Hugging Face和tiktoken获取的现有模型词汇表；以及示例词元向量在RoBERTa（大型）模型各层间传递时所包含的信息。除了创建次优的语义构建模块和模糊模型对必要分布模式的访问外，我们还描述了词元化预训练如何成为偏见及其他不良内容的后门，而当前的校准实践可能无法纠正这一问题。此外，我们提供的证据表明，尽管词元化算法的目标函数在相当程度上与系统主体智能相隔离，但它仍会影响LLM的认知。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日