ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints

Reasoning about Actions and Change (RAC) has historically played a pivotal role in solving foundational AI problems, such as the frame problem. It has driven advancements in AI fields, such as non-monotonic and commonsense reasoning. RAC remains crucial for AI systems that operate in dynamic environments, engage in interactive scenarios, or rely on commonsense reasoning. Despite substantial advances made by Large Language Models (LLMs) in various AI domains, their performance in RAC remains underexplored. To address this gap, we introduce a new diagnostic benchmark, ActionReasoningBench, which encompasses 8 domains and includes questions for up to 19 action sequences. This benchmark rigorously evaluates LLMs across six key RAC dimensions: Fluent Tracking, State Tracking, Action Executability, Effects of Actions, Numerical RAC, and Composite Questions. LLMs demonstrate average accuracy rates of 73.55%, 65.63%, 58.73%, and 62.38% on the former four dimensions, which are frequently discussed in RAC literature. However, the performance on the latter two dimensions, which introduce complex and novel reasoning questions, the average performance of LLMs is lowered to 33.16% and 51.19%, respectively, reflecting a 17.9% performance decline. We also introduce new ramification constraints to capture the indirect effects of actions, providing deeper insights into RAC challenges. Our evaluation of state-of-the-art LLMs, including both open-source and commercial models, reveals challenges across all RAC dimensions, particularly in handling ramifications, with GPT-4o failing to solve any question and o1-preview achieving a score of only 18.4%.

翻译：动作与变化推理（RAC）历来在解决人工智能基础问题（如框架问题）中发挥着关键作用，推动了非单调推理和常识推理等人工智能领域的进展。对于在动态环境中运行、参与交互场景或依赖常识推理的人工智能系统而言，RAC仍然至关重要。尽管大语言模型（LLMs）在多个AI领域取得了显著进展，但它们在RAC任务中的表现仍未得到充分探索。为填补这一空白，我们提出了一个新的诊断基准——ActionReasoningBench，涵盖8个领域，包含最多19个动作序列的推理问题。该基准从六个关键RAC维度对LLMs进行严格评估：状态流跟踪、状态追踪、动作可执行性、动作效应、数值RAC以及复合问题。LLMs在前四个维度（RAC文献中常讨论的维度）的平均准确率分别为73.55%、65.63%、58.73%和62.38%。然而，在后两个引入复杂新颖推理问题的维度上，LLMs的平均性能分别下降至33.16%和51.19%，反映出17.9%的性能降幅。我们还引入了新的衍生约束以捕捉动作的间接效应，从而更深入地揭示RAC的挑战。我们对包括开源和商业模型在内的前沿LLMs进行评估，发现它们在所有RAC维度上都面临挑战，尤其在处理衍生效应方面表现不佳：GPT-4o未能解决任何相关问题，而o1-preview仅获得18.4%的得分。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

37+阅读 · 2019年10月17日