Visual planning simulates how humans make decisions to achieve desired goals in the form of searching for visual causal transitions between an initial visual state and a final visual goal state. It has become increasingly important in egocentric vision with its advantages in guiding agents to perform daily tasks in complex environments. In this paper, we propose an interpretable and generalizable visual planning framework consisting of i) a novel Substitution-based Concept Learner (SCL) that abstracts visual inputs into disentangled concept representations, ii) symbol abstraction and reasoning that performs task planning via the self-learned symbols, and iii) a Visual Causal Transition model (ViCT) that grounds visual causal transitions to semantically similar real-world actions. Given an initial state, we perform goal-conditioned visual planning with a symbolic reasoning method fueled by the learned representations and causal transitions to reach the goal state. To verify the effectiveness of the proposed model, we collect a large-scale visual planning dataset based on AI2-THOR, dubbed as CCTP. Extensive experiments on this challenging dataset demonstrate the superior performance of our method in visual task planning. Empirically, we show that our framework can generalize to unseen task trajectories, unseen object categories, and real-world data. Further details of this work are provided at https://fqyqc.github.io/ConTranPlan/.
翻译:视觉规划模拟人类通过搜索初始视觉状态与最终视觉目标状态之间的视觉因果转换来做出决策以实现预期目标的过程。由于其具有引导智能体在复杂环境中执行日常任务的优势,在自我中心视觉领域变得日益重要。本文提出了一种可解释且可泛化的视觉规划框架,包含:i) 新颖的基于替换的概念学习器(SCL),将视觉输入抽象为解耦的概念表征;ii) 符号抽象与推理机制,通过自学习符号执行任务规划;iii) 视觉因果转换模型(ViCT),将视觉因果转换映射为语义相似的真实世界动作。给定初始状态,我们通过结合所学表征与因果转换的符号推理方法执行目标条件化视觉规划,直至达到目标状态。为验证模型有效性,我们基于AI2-THOR构建了大规模视觉规划数据集CCTP。在该挑战性数据集上的大量实验表明,本方法在视觉任务规划中具有卓越性能。实证结果显示,该框架能够泛化至未见任务轨迹、未见物体类别及真实世界数据。更多研究细节详见https://fqyqc.github.io/ConTranPlan/。