Recent advancements in Large Language Models (LLMs) are increasingly focused on "reasoning" ability, a concept with many overlapping definitions in the LLM discourse. We take a more structured approach, distinguishing meta-level reasoning (denoting the process of reasoning about intermediate steps required to solve a task) from object-level reasoning (which concerns the low-level execution of the aforementioned steps.) We design a novel question answering task, which is based around the values of geopolitical indicators for various countries over various years. Questions require breaking down into intermediate steps, retrieval of data, and mathematical operations over that data. The meta-level reasoning ability of LLMs is analysed by examining the selection of appropriate tools for answering questions. To bring greater depth to the analysis of LLMs beyond final answer accuracy, our task contains 'essential actions' against which we can compare the tool call output of LLMs to infer the strength of reasoning ability. We find that LLMs demonstrate good meta-level reasoning on our task, yet are flawed in some aspects of task understanding. We find that n-shot prompting has little effect on accuracy; error messages encountered do not often deteriorate performance; and provide additional evidence for the poor numeracy of LLMs. Finally, we discuss the generalisation and limitation of our findings to other task domains.
翻译:近期大型语言模型(LLM)的研究进展日益聚焦于“推理”能力,这一概念在LLM讨论中存在诸多重叠定义。我们采用更具结构化的方法,区分元级推理(指对解决任务所需中间步骤的推理过程)与对象级推理(涉及前述步骤的低层级执行)。我们设计了一种新颖的问答任务,其基础是各国在不同年份的地缘政治指标数值。问题需要分解为中间步骤、数据检索及基于数据的数学运算。通过考察LLM为回答问题选择合适工具的过程,我们分析了其元级推理能力。为了超越最终答案准确率对LLM进行更深入的分析,我们的任务包含“核心操作”,通过将LLM的工具调用输出与之对比,可以推断其推理能力的强度。研究发现,LLM在我们的任务中展现出良好的元级推理能力,但在任务理解的某些方面仍存在缺陷。我们发现少样本提示对准确率影响甚微;遇到错误信息通常不会导致性能下降;同时为LLM较弱的计算能力提供了额外证据。最后,我们讨论了本研究结论在其他任务领域的泛化性与局限性。