超越词嵌入的涌现语义：基于冻结视觉Unicode表示的Transformer语言模型 (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)

Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to "representational interference" in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer's compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.

翻译：理解大型语言模型（LLM）中语义表征的所在对于模型可解释性与架构创新至关重要。主流范式假定可训练的输入嵌入层是基础的“意义向量”。本文挑战了这一观点。我们构建了嵌入层完全冻结的Transformer模型，其向量并非源于数据，而是源自Unicode字符视觉结构。这些非语义的、预计算的视觉嵌入在训练过程中始终保持固定。我们的方法兼容任何分词器，包括我们提出的新型以Unicode为核心的分词器，以确保对任意文本的覆盖。尽管缺乏可训练的、具备语义初始化的嵌入层，我们的模型仍能收敛、生成连贯文本，并且关键的是，在MMLU推理基准测试中超越了架构相同但采用可训练嵌入层的模型。我们将此归因于传统模型中存在的“表征干扰”——嵌入层被迫同时学习结构特征与语义特征。我们的结果表明，高层语义并非输入嵌入的固有属性，而是Transformer组合架构与数据规模共同作用下的涌现特性。这重新定义了嵌入层的角色：从意义容器转变为结构基元。我们公开所有代码与模型以促进后续研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日