Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic context-free grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we (i) introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled; (ii) provide a learning mechanism -- an inference algorithm inspired by the structure of deep convolutional networks -- that links learnability and sample complexity to specific language statistics; and (iii) validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.
翻译:理解语言结构如何仅从句子中习得,是认知科学与机器学习领域的核心问题。大型语言模型(LLMs)的内部表征研究表明,其在预测下一个词时具备解析文本的能力,同时能独立于表层形式表征语义概念。然而,究竟是哪些数据统计特性使得这些能力成为可能,以及需要多少数据,目前仍很大程度上未知。概率上下文无关文法(PCFGs)为研究这些问题提供了一个可处理的实验平台。然而,先前的研究要么侧重于对已训练网络所使用的类解析算法进行事后分析;要么关注具有固定语法、无需解析的PCFGs的可学习性。本文中,我们(i)引入一类可调节的PCFGs,其中歧义程度与跨尺度的相关结构均可受控;(ii)提出一种受深度卷积网络结构启发的推断算法作为学习机制,将可学习性与样本复杂度与特定的语言统计量联系起来;(iii)通过实验在深度卷积架构和基于Transformer的架构上验证了我们的预测。总体而言,我们提出了一个统一框架,其中不同尺度的相关性能够消除局部歧义,从而促使数据层次化表征的涌现。