ActionReasoningBench: Reasoning about Actions with and without Ramification Constraints

Reasoning about Actions and Change (RAC) has historically played a pivotal role in solving foundational AI problems, such as the frame problem. It has driven advancements in AI fields, such as non-monotonic and commonsense reasoning. RAC remains crucial for AI systems that operate in dynamic environments, engage in interactive scenarios, or rely on commonsense reasoning. Despite substantial advances made by Large Language Models (LLMs) in various AI domains, their performance in RAC remains underexplored. To address this gap, we introduce a new diagnostic benchmark, ActionReasoningBench, which encompasses 8 domains and includes questions for up to 19 action sequences. This benchmark rigorously evaluates LLMs across six key RAC dimensions: Fluent Tracking, State Tracking, Action Executability, Effects of Actions, Numerical RAC, and Composite Questions. LLMs demonstrate average accuracy rates of 73.55%, 65.63%, 58.73%, and 62.38% on the former four dimensions, which are frequently discussed in RAC literature. However, the performance on the latter two dimensions, which introduce complex and novel reasoning questions, the average performance of LLMs is lowered to 33.16% and 51.19%, respectively, reflecting a 17.9% performance decline. We also introduce new ramification constraints to capture the indirect effects of actions, providing deeper insights into RAC challenges. Our evaluation of state-of-the-art LLMs, including both open-source and commercial models, reveals challenges across all RAC dimensions, particularly in handling ramifications, with GPT-4o failing to solve any question and o1-preview achieving a score of only 18.4%.

翻译：动作与变化推理在历史上对解决人工智能基础问题（如框架问题）发挥了关键作用，并推动了非单调推理与常识推理等人工智能领域的进展。对于在动态环境中运行、参与交互场景或依赖常识推理的人工智能系统而言，动作与变化推理仍然至关重要。尽管大语言模型在多个AI领域取得了显著进展，但其在动作与变化推理方面的性能仍未得到充分探索。为填补这一空白，我们提出了一个新的诊断基准——ActionReasoningBench，涵盖8个领域并包含最多19个动作序列的问题。该基准从六个关键的动作与变化推理维度对大语言模型进行严格评估：流跟踪、状态跟踪、动作可执行性、动作效应、数值动作与变化推理及复合问题。大语言模型在前四个维度（这些维度在动作与变化推理文献中常被讨论）的平均准确率分别为73.55%、65.63%、58.73%和62.38%。然而，在后两个引入复杂新颖推理问题的维度上，大语言模型的平均性能分别降至33.16%和51.19%，反映出17.9%的性能下降。我们还引入了新的衍生约束以捕捉动作的间接效应，从而更深入地揭示动作与变化推理的挑战。我们对包括开源和商业模型在内的前沿大语言模型的评估显示，所有动作与变化推理维度均存在挑战，尤其在处理衍生效应方面：GPT-4o未能解决任何问题，而o1-preview仅获得18.4%的得分。