Advances in computational methods and big data availability have recently translated into breakthroughs in AI applications. With successes in bottom-up challenges partially overshadowing shortcomings, the 'human-like' performance of Large Language Models has raised the question of how linguistic performance is achieved by algorithms. Given systematic shortcomings in generalization across many AI systems, in this work we ask whether linguistic performance is indeed guided by language knowledge in Large Language Models. To this end, we prompt GPT-3 with a grammaticality judgement task and comprehension questions on less frequent constructions that are thus unlikely to form part of Large Language Models' training data. These included grammatical 'illusions', semantic anomalies, complex nested hierarchies and self-embeddings. GPT-3 failed for every prompt but one, often offering answers that show a critical lack of understanding even of high-frequency words used in these less frequent grammatical constructions. The present work sheds light on the boundaries of the alleged AI human-like linguistic competence and argues that, far from human-like, the next-word prediction abilities of LLMs may face issues of robustness, when pushed beyond training data.
翻译:计算方法及大数据获取方面的进展近来已转化为人工智能应用的突破性进展。在自下而上的挑战取得成功的部分掩盖了其不足的情况下,大语言模型表现出的"类人"性能引发了关于算法如何实现语言能力的问题。鉴于众多人工智能系统在泛化方面存在系统性缺陷,本研究旨在探究大语言模型的语言表现是否确实受语言知识引导。为此,我们使用GPT-3进行语法判断任务及理解性问题测试,这些测试涉及低频语法结构——这些结构不太可能成为大语言模型训练数据的一部分,包括语法"错觉"、语义异常、复杂嵌套层级及自嵌入结构。GPT-3在所有测试中仅成功应答一项,其提供的答案暴露出对高频词汇在此类低频语法结构中用法的理解严重缺失。本研究揭示了所谓人工智能类人语言能力的边界,并论证表明:大语言模型的下一词预测能力远未达到类人水平,在超出训练数据范围时可能面临鲁棒性问题。