Visual planning simulates how humans make decisions to achieve desired goals in the form of searching for visual causal transitions between an initial visual state and a final visual goal state. It has become increasingly important in egocentric vision with its advantages in guiding agents to perform daily tasks in complex environments. In this paper, we propose an interpretable and generalizable visual planning framework consisting of i) a novel Substitution-based Concept Learner (SCL) that abstracts visual inputs into disentangled concept representations, ii) symbol abstraction and reasoning that performs task planning via the self-learned symbols, and iii) a Visual Causal Transition model (ViCT) that grounds visual causal transitions to semantically similar real-world actions. Given an initial state, we perform goal-conditioned visual planning with a symbolic reasoning method fueled by the learned representations and causal transitions to reach the goal state. To verify the effectiveness of the proposed model, we collect a large-scale visual planning dataset based on AI2-THOR, dubbed as CCTP. Extensive experiments on this challenging dataset demonstrate the superior performance of our method in visual task planning. Empirically, we show that our framework can generalize to unseen task trajectories and unseen object categories.
翻译:视觉规划通过搜索初始视觉状态与最终视觉目标状态之间的视觉因果转换,模拟人类为实现预期目标而做出决策的过程。该技术在具身视觉领域日益重要,因其能有效指导智能体在复杂环境中执行日常任务。本文提出一种可解释且泛化性强的视觉规划框架,包含:i) 新型基于替换的概念学习器(SCL),可将视觉输入抽象为解耦的概念表征;ii) 符号抽象与推理模块,通过自学习符号执行任务规划;iii) 视觉因果转换模型(ViCT),将视觉因果转换映射为语义相似的真实世界动作。给定初始状态,我们基于学习到的表征与因果转换,采用符号推理方法进行目标驱动的视觉规划,直至达到目标状态。为验证模型有效性,基于AI2-THOR平台构建大规模视觉规划数据集CCTP。在该挑战性数据集上的大量实验表明,所提方法在视觉任务规划中具有优越性能。实验证明,本框架可泛化至未见过的任务轨迹与物体类别。