Large Language Models have shown tremendous performance on a large variety of natural language processing tasks, ranging from text comprehension to common sense reasoning. However, the mechanisms responsible for this success remain unknown, and it is unclear whether LLMs can achieve human-like cognitive capabilities or whether these models are still fundamentally limited. Abstract reasoning is a fundamental task for cognition, consisting of finding and applying a general pattern from few data. Evaluating deep neural architectures on this task could give insight into their potential limitations regarding reasoning and their broad generalisation abilities, yet this is currently an under-explored area. In this paper, we perform extensive evaluations of state-of-the-art LLMs on abstract reasoning tasks, showing that they achieve very limited performance in contrast with other natural language tasks, and we investigate the reasons for this difference. We apply techniques that have been shown to improve performance on other NLP tasks and show that in most cases their impact on abstract reasoning performance is limited. In the course of this work, we have generated a new benchmark for evaluating language models on abstract reasoning tasks.
翻译:大型语言模型在从文本理解到常识推理的广泛自然语言处理任务中展现了卓越的性能。然而,支撑这一成功的机制仍不明确,尚不清楚这些模型能否实现人类般的认知能力,抑或存在根本性局限。抽象推理作为认知的核心任务,要求从少量数据中发现并应用通用模式。评估深度神经网络在此类任务上的表现,有助于揭示其在推理能力及广泛泛化方面的潜在局限,但目前这一领域尚待深入探索。本文对当前最先进的大型语言模型在抽象推理任务上进行了全面评估,结果显示,与自然语言任务相比,这些模型在此领域的表现极为有限。我们进一步探究了造成这一差异的原因,并应用了已被证明可提升其他NLP任务性能的技术,但发现这些技术在绝大多数情况下对抽象推理性能的提升效果甚微。在研究过程中,我们构建了一个新的基准测试,用于评估语言模型在抽象推理任务上的表现。