Large Language Models have shown tremendous performance on a large variety of natural language processing tasks, ranging from text comprehension to common sense reasoning. However, the mechanisms responsible for this success remain opaque, and it is unclear whether LLMs can achieve human-like cognitive capabilities or whether these models are still fundamentally circumscribed. Abstract reasoning is a fundamental task for cognition, consisting of finding and applying a general pattern from few data. Evaluating deep neural architectures on this task could give insight into their potential limitations regarding reasoning and their broad generalisation abilities, yet this is currently an under-explored area. In this paper, we introduce a new benchmark for evaluating language models beyond memorization on abstract reasoning tasks. We perform extensive evaluations of state-of-the-art LLMs, showing that they currently achieve very limited performance in contrast with other natural language tasks, even when applying techniques that have been shown to improve performance on other NLP tasks. We argue that guiding LLM generation to follow causal paths could help improve the generalisation and reasoning abilities of LLMs.
翻译:大型语言模型在从文本理解到常识推理等大量自然语言处理任务中展现了卓越性能。然而,支撑这一成功的机制仍不透明,尚不清楚这些模型能否实现类人认知能力,或是否仍存在根本性局限。抽象推理作为认知的基础任务,需要从少量数据中找出并应用通用模式。评估深度神经网络架构在该任务上的表现,有助于揭示其在推理与广泛泛化能力层面的潜在局限——但目前这一领域仍属未充分探索的空白地带。本文提出了一项新基准,旨在评估语言模型在抽象推理任务中超越单纯记忆的能力。我们对当前最先进的大型语言模型进行了全面评估,结果显示:即便采用已被证明能提升其他自然语言任务性能的技术,这些模型在抽象推理上的表现仍极为有限。我们认为,引导大型语言模型生成过程遵循因果路径,或有助于提升其泛化与推理能力。