Many current NLP systems are built from language models trained to optimize unsupervised objectives on large amounts of raw text. Under what conditions might such a procedure acquire meaning? Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations (i.e., languages with strong transparency), both autoregressive and masked language models successfully learn to emulate semantic relations between expressions. However, when denotations are changed to be context-dependent with the language otherwise unmodified, this ability degrades. Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not represent natural language semantics well. We show this failure relates to the context-dependent nature of natural language form-meaning mappings.
翻译:许多当前的NLP系统基于语言模型构建,这些模型通过优化大量原始文本上的无监督目标进行训练。在何种条件下,此类过程可能习得意义?我们使用合成数据进行的系统性实验表明,在语言中所有表达式都具有与上下文无关的指称(即具有强透明度的语言)时,自回归和掩码语言模型均能成功学习模拟表达式之间的语义关系。然而,当指称被改为依赖上下文而语言其他部分保持不变时,这种能力会退化。转向自然语言,我们针对特定现象——指称不透明性——的实验进一步证实了当前语言模型未能良好表征自然语言语义的广泛证据。我们证明,这种失败与自然语言形式-意义映射的上下文依赖性本质有关。