We present evidence that language models can learn meaning despite being trained only to perform next token prediction on text, specifically a corpus of programs. Each program is preceded by a specification in the form of (textual) input-output examples. Working with programs enables us to precisely define concepts relevant to meaning in language (e.g., correctness and semantics), making program synthesis well-suited as an intermediate testbed for characterizing the presence (or absence) of meaning in language models. We first train a Transformer model on the corpus of programs, then probe the trained model's hidden states as it completes a program given a specification. Despite providing no inductive bias toward learning the semantics of the language, we find that a linear probe is able to extract abstractions of both current and future program states from the model states. Moreover, there is a strong, statistically significant correlation between the accuracy of the probe and the model's ability to generate a program that implements the specification. To evaluate whether the semantics are represented in the model states rather than learned by the probe, we design a novel experimental procedure that intervenes on the semantics of the language while preserving the lexicon and syntax. We also demonstrate that the model learns to generate correct programs that are, on average, shorter than those in the training set, which is evidence that language model outputs may differ from the training distribution in semantically meaningful ways. In summary, this paper does not propose any new techniques for training language models, but develops an experimental framework for and provides insights into the acquisition and representation of (formal) meaning in language models.
翻译:我们提出证据表明,尽管语言模型仅接受文本(具体为程序语料库)上的下一词元预测训练,仍能习得意义。每个程序前附有以(文本)输入-输出示例形式呈现的规范。通过处理程序,我们能够精确定义与语言意义相关的概念(如正确性与语义),从而使得程序合成成为表征语言模型中意义存在(或缺失)的理想中间测试平台。我们首先在程序语料库上训练Transformer模型,然后探测训练后模型在给定规范下完成程序时的隐藏状态。尽管未向模型提供任何引导其学习语言语义的归纳偏置,我们发现线性探针能够从模型状态中提取当前及未来程序状态的抽象表征。此外,探针准确率与模型生成实现规范的程序能力之间存在强且统计显著的相关性。为评估语义是表征于模型状态中而非由探针习得,我们设计了一种新颖的实验流程,在保留词汇和句法的前提下干预语言语义。我们还证明,模型学习生成的正确程序平均长度短于训练集程序,这证明语言模型输出可能在语义层面以有意义的方式偏离训练分布。总之,本文未提出任何训练语言模型的新技术,而是为语言模型中(形式化)意义的获取与表征构建了实验框架并提供了洞见。