Neural network language models can serve as computational hypotheses about how humans process language. We compared the model-human consistency of diverse language models using a novel experimental approach: controversial sentence pairs. For each controversial sentence pair, two language models disagree about which sentence is more likely to occur in natural text. Considering nine language models (including n-gram, recurrent neural networks, and transformer models), we created hundreds of such controversial sentence pairs by either selecting sentences from a corpus or synthetically optimizing sentence pairs to be highly controversial. Human subjects then provided judgments indicating for each pair which of the two sentences is more likely. Controversial sentence pairs proved highly effective at revealing model failures and identifying models that aligned most closely with human judgments. The most human-consistent model tested was GPT-2, although experiments also revealed significant shortcomings of its alignment with human perception.
翻译:神经网络语言模型可以作为关于人类如何处理语言的计算假设。我们采用一种新颖的实验方法——争议句对,比较了多种语言模型与人类判断的一致性。对于每个争议句对,两个语言模型对哪一句更可能出现在自然文本中存在分歧。通过考虑九种语言模型(包括n-gram、循环神经网络和Transformer模型),我们通过从语料库中选择句子或合成优化高度争议的句对,生成了数百个这样的争议句对。随后,人类受试者对每一对句子中哪一句更可能发生提供了判断。争议句对在揭示模型缺陷以及识别与人类判断最一致的模型方面被证明非常有效。测试中与人类一致性最高的模型是GPT-2,尽管实验也揭示了其与人类感知对齐的显著不足。