Large language models (LLMs) have recently shown impressive performance on tasks involving reasoning, leading to a lively debate on whether these models possess reasoning capabilities similar to humans. However, despite these successes, the depth of LLMs' reasoning abilities remains uncertain. This uncertainty partly stems from the predominant focus on task performance, measured through shallow accuracy metrics, rather than a thorough investigation of the models' reasoning behavior. This paper seeks to address this gap by providing a comprehensive review of studies that go beyond task accuracy, offering deeper insights into the models' reasoning processes. Furthermore, we survey prevalent methodologies to evaluate the reasoning behavior of LLMs, emphasizing current trends and efforts towards more nuanced reasoning analyses. Our review suggests that LLMs tend to rely on surface-level patterns and correlations in their training data, rather than on sophisticated reasoning abilities. Additionally, we identify the need for further research that delineates the key differences between human and LLM-based reasoning. Through this survey, we aim to shed light on the complex reasoning processes within LLMs.
翻译:近年来,大语言模型(LLMs)在涉及推理的任务上展现出令人瞩目的性能,引发了关于这些模型是否具备类似人类推理能力的激烈讨论。然而,尽管取得了这些成功,LLMs 推理能力的深度仍不确定。这种不确定性部分源于当前研究主要聚焦于通过浅层的准确性指标来衡量任务性能,而非对模型的推理行为进行深入探究。本文旨在通过全面综述那些超越任务准确性、为模型推理过程提供更深入见解的研究,以弥补这一空白。此外,我们综述了评估 LLMs 推理行为的流行方法,重点介绍了当前趋势以及为实现更细致推理分析所做的努力。我们的综述表明,LLMs 倾向于依赖其训练数据中的表层模式和相关性,而非复杂的推理能力。同时,我们指出需要进一步研究以阐明人类推理与基于 LLM 的推理之间的关键差异。通过本次综述,我们期望能揭示 LLMs 内部复杂的推理过程。