In the last five years, there has been a significant focus in Natural Language Processing (NLP) on developing larger Pretrained Language Models (PLMs) and introducing benchmarks such as SuperGLUE and SQuAD to measure their abilities in language understanding, reasoning, and reading comprehension. These PLMs have achieved impressive results on these benchmarks, even surpassing human performance in some cases. This has led to claims of superhuman capabilities and the provocative idea that certain tasks have been solved. In this position paper, we take a critical look at these claims and ask whether PLMs truly have superhuman abilities and what the current benchmarks are really evaluating. We show that these benchmarks have serious limitations affecting the comparison between humans and PLMs and provide recommendations for fairer and more transparent benchmarks.
翻译:在过去五年中,自然语言处理(NLP)领域一直重点关注开发更大的预训练语言模型(PLMs),并引入诸如SuperGLUE和SQuAD等基准测试来评估其语言理解、推理和阅读理解能力。这些PLMs在基准测试中取得了令人瞩目的结果,甚至在某些情况下超越了人类表现。这引发了关于“超人能力”的宣称以及某些任务已被解决的大胆观点。在本立场论文中,我们批判性地审视这些宣称,并追问PLMs是否真正拥有超人能力,以及当前基准测试究竟在评估什么。我们指出这些基准测试存在严重影响人类与PLMs比较的局限性,并为更公平、更透明的基准测试提供建议。