视觉环境交互规划在具身复杂问答中的应用 (Visual Environment-Interactive Planning for Embodied Complex-Question Answering)

This study focuses on Embodied Complex-Question Answering task, which means the embodied robot need to understand human questions with intricate structures and abstract semantics. The core of this task lies in making appropriate plans based on the perception of the visual environment. Existing methods often generate plans in a once-for-all manner, i.e., one-step planning. Such approach rely on large models, without sufficient understanding of the environment. Considering multi-step planning, the framework for formulating plans in a sequential manner is proposed in this paper. To ensure the ability of our framework to tackle complex questions, we create a structured semantic space, where hierarchical visual perception and chain expression of the question essence can achieve iterative interaction. This space makes sequential task planning possible. Within the framework, we first parse human natural language based on a visual hierarchical scene graph, which can clarify the intention of the question. Then, we incorporate external rules to make a plan for current step, weakening the reliance on large models. Every plan is generated based on feedback from visual perception, with multiple rounds of interaction until an answer is obtained. This approach enables continuous feedback and adjustment, allowing the robot to optimize its action strategy. To test our framework, we contribute a new dataset with more complex questions. Experimental results demonstrate that our approach performs excellently and stably on complex tasks. And also, the feasibility of our approach in real-world scenarios has been established, indicating its practical applicability.

翻译：本研究聚焦于具身复杂问答任务，即具身机器人需要理解具有复杂结构和抽象语义的人类问题。该任务的核心在于基于视觉环境感知制定适当的规划。现有方法通常采用一次性规划方式，即单步规划。此类方法依赖大模型，对环境理解不足。考虑到多步规划，本文提出了一种顺序生成规划的框架。为确保框架处理复杂问题的能力，我们构建了一个结构化语义空间，其中层次化视觉感知与问题本质的链式表达可实现迭代交互。该空间使顺序任务规划成为可能。在框架内，我们首先基于视觉层次场景图解析人类自然语言，以明确问题意图。随后，我们引入外部规则制定当前步骤的规划，从而减弱对大模型的依赖。每个规划均基于视觉感知反馈生成，通过多轮交互直至获得答案。该方法支持持续反馈与调整，使机器人能够优化其行动策略。为验证框架性能，我们构建了一个包含更复杂问题的新数据集。实验结果表明，我们的方法在复杂任务上表现优异且稳定。同时，该方法在真实场景中的可行性已得到验证，证明了其实际应用价值。