ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking

Classical robotic systems typically rely on custom planners designed for constrained environments. While effective in restricted settings, these systems lack generalization capabilities, limiting the scalability of embodied AI and general-purpose robots. Recent data-driven Vision-Language-Action (VLA) approaches aim to learn policies from large-scale simulation and real-world data. However, the continuous action space of the physical world significantly exceeds the representational capacity of linguistic tokens, making it unclear if scaling data alone can yield general robotic intelligence. To address this gap, we propose ActionReasoning, an LLM-driven framework that performs explicit action reasoning to produce physics-consistent, prior-guided decisions for robotic manipulation. ActionReasoning leverages the physical priors and real-world knowledge already encoded in Large Language Models (LLMs) and structures them within a multi-agent architecture. We instantiate this framework on a tractable case study of brick stacking, where the environment states are assumed to be already accurately measured. The environmental states are then serialized and passed to a multi-agent LLM framework that generates physics-aware action plans. The experiments demonstrate that the proposed multi-agent LLM framework enables stable brick placement while shifting effort from low-level domain-specific coding to high-level tool invocation and prompting, highlighting its potential for broader generalization. This work introduces a promising approach to bridging perception and execution in robotic manipulation by integrating physical reasoning with LLMs.

翻译：经典机器人系统通常依赖于为受限环境设计的定制规划器。尽管在受限场景中有效，这些系统缺乏泛化能力，限制了具身人工智能和通用机器人的可扩展性。近期数据驱动的视觉-语言-动作方法旨在从大规模仿真和真实世界数据中学习策略。然而，物理世界的连续动作空间显著超出了语言符号的表征能力，仅通过扩展数据是否能够产生通用机器人智能尚不明确。为填补这一空白，我们提出ActionReasoning——一个基于大语言模型的框架，通过显式动作推理生成符合物理规律、先验引导的机器人操作决策。ActionReasoning利用已编码在大语言模型中的物理先验和现实世界知识，并将其组织在多智能体架构中。我们在砖块堆叠这一可处理的案例研究中实例化了该框架，其中假设环境状态已被精确测量。环境状态经序列化后输入多智能体大语言模型框架，生成具有物理感知的动作规划。实验表明，所提出的多智能体大语言模型框架能够实现稳定的砖块放置，同时将工作重心从底层领域特定编码转向高层工具调用与提示工程，凸显了其更广泛泛化的潜力。这项工作通过将物理推理与大语言模型相结合，为机器人操作中感知与执行的衔接提供了一种具有前景的研究路径。