Pre-trained language models have been shown to encode linguistic structures, e.g. dependency and constituency parse trees, in their embeddings while being trained on unsupervised loss functions like masked language modeling. Some doubts have been raised whether the models actually are doing parsing or only some computation weakly correlated with it. We study questions: (a) Is it possible to explicitly describe transformers with realistic embedding dimension, number of heads, etc. that are capable of doing parsing -- or even approximate parsing? (b) Why do pre-trained models capture parsing structure? This paper takes a step toward answering these questions in the context of generative modeling with PCFGs. We show that masked language models like BERT or RoBERTa of moderate sizes can approximately execute the Inside-Outside algorithm for the English PCFG [Marcus et al, 1993]. We also show that the Inside-Outside algorithm is optimal for masked language modeling loss on the PCFG-generated data. We also give a construction of transformers with $50$ layers, $15$ attention heads, and $1275$ dimensional embeddings in average such that using its embeddings it is possible to do constituency parsing with $>70\%$ F1 score on PTB dataset. We conduct probing experiments on models pre-trained on PCFG-generated data to show that this not only allows recovery of approximate parse tree, but also recovers marginal span probabilities computed by the Inside-Outside algorithm, which suggests an implicit bias of masked language modeling towards this algorithm.
翻译:摘要:预训练语言模型已被证明能在其嵌入中编码语言结构(例如依存句法树和成分句法树),尽管其训练仅基于无监督损失函数(如掩码语言建模)。但学界对模型是否真正执行句法分析,或仅是与句法分析存在弱关联的计算提出了质疑。我们研究以下问题:(a) 是否可能明确定义具有实际嵌入维度、注意力头数等参数的Transformer,使其能够执行句法分析甚至近似句法分析?(b) 为何预训练模型能捕获句法结构?本文在PCFG生成建模语境下,向回答这些问题迈出一步。我们证明:中等规模的BERT或RoBERTa等掩码语言模型可近似执行英语PCFG的内-外算法(Inside-Outside算法)[Marcus等, 1993]。同时表明,对于PCFG生成的数据,内-外算法对掩码语言建模损失函数而言是最优的。我们还给出一种Transformer构造:平均使用50层、15个注意力头、1275维嵌入,使得其嵌入可在PTB数据集上以>70%的F1分数实现成分句法分析。针对PCFG生成数据预训练模型的探针实验表明:这不仅可恢复近似句法树,还能恢复由内-外算法计算的边际跨度概率,这揭示了掩码语言建模对该算法的隐式偏好。