Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning

This paper introduces Knowledge Graph based Massively Multi-task Model-based Policy Optimization (KG-M3PO), a framework for multi-task robotic manipulation in partially observable settings that unifies Perception, Knowledge, and Policy. The method augments egocentric vision with an online 3D scene graph that grounds open-vocabulary detections into a metric, relational representation. A dynamic-relation mechanism updates spatial, containment, and affordance edges at every step, and a graph neural encoder is trained end-to-end through the RL objective so that relational features are shaped directly by control performance. Multiple observation modalities (visual, proprioceptive, linguistic, and graph-based) are encoded into a shared latent space, upon which the RL agent operates to drive the control loop. The policy conditions on lightweight graph queries alongside visual and proprioceptive inputs, yielding a compact, semantically informed state for decision making. Experiments on a suite of manipulation tasks with occlusions, distractors, and layout shifts demonstrate consistent gains over strong baselines: the knowledge-conditioned agent achieves higher success rates, improved sample efficiency, and stronger generalization to novel objects and unseen scene configurations. These results support the premise that structured, continuously maintained world knowledge is a powerful inductive bias for scalable, generalizable manipulation: when the knowledge module participates in the RL computation graph, relational representations align with control, enabling robust long-horizon behavior under partial observability.

翻译：本文提出基于知识图谱的大规模多任务模型化策略优化（KG-M3PO）框架,用于部分可观测环境下的多任务机器人操控,该框架统一了感知、知识与策略。该方法通过在线三维场景图增强第一人称视觉,将开放词汇检测结果锚定到度量关系表示中。动态关系机制在每一步更新空间、包含和功能边,并通过强化学习目标进行端到端训练图神经网络编码器,使关系特征直接受控制性能塑造。多种观测模态（视觉、本体感知、语言和图结构）被编码到共享潜在空间,强化学习智能体在此空间运行以驱动控制循环。策略以轻量级图查询结合视觉和本体感知输入为条件,生成紧凑且包含语义信息的状态用于决策。在包含遮挡、干扰物和布局变化的系列操控任务中的实验表明,该方法相比强基线获得了一致提升:知识条件智能体实现了更高成功率、更优样本效率,以及对新物体和未见场景配置的更强泛化能力。这些结果支持了一个前提:结构化、持续维护的世界知识是可扩展、可泛化操控的有力归纳偏置——当知识模块参与强化学习计算图时,关系表示与控制对齐,从而在部分可观测条件下实现鲁棒的长期行为。