This paper introduces Filtered Corpus Training, a method that trains language models (LMs) on corpora with certain linguistic constructions filtered out from the training data, and uses it to measure the ability of LMs to perform linguistic generalization on the basis of indirect evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect evidence.
翻译:本文提出过滤语料训练方法,该方法通过在训练数据中滤除特定语言结构的语料来训练语言模型,并以此评估语言模型基于间接证据进行语言泛化的能力。我们将该方法应用于规模大致相当的LSTM和Transformer语言模型,构建了针对广泛语言现象的过滤语料库。实验结果表明:虽然Transformer在语言模型指标(以困惑度衡量)上表现更优,但两种模型在语言泛化度量上均展现出同等且令人惊讶的良好性能,这表明它们能够从间接证据中进行有效泛化。