Reinforcement Learning (RL) and Imitation Learning (IL) have made great progress in robotic control in recent years. However, these methods show obvious deterioration for new tasks that need to be completed through new combinations of actions. RL methods heavily rely on reward functions that cannot generalize well for new tasks, while IL methods are limited by expert demonstrations which do not cover new tasks. In contrast, humans can easily complete these tasks with the fragmented knowledge learned from task-agnostic experience. Inspired by this observation, this paper proposes a task-agnostic learning method (TAL for short) that can learn fragmented knowledge from task-agnostic data to accomplish new tasks. TAL consists of four stages. First, the task-agnostic exploration is performed to collect data from interactions with the environment. The collected data is organized via a knowledge graph. Compared with the previous sequential structure, the knowledge graph representation is more compact and fits better for environment exploration. Second, an action feature extractor is proposed and trained using the collected knowledge graph data for task-agnostic fragmented knowledge learning. Third, a candidate action generator is designed, which applies the action feature extractor on a new task to generate multiple candidate action sets. Finally, an action proposal is designed to produce the probabilities for actions in a new task according to the environmental information. The probabilities are then used to select actions to be executed from multiple candidate action sets to form the plan. Experiments on a virtual indoor scene show that the proposed method outperforms the state-of-the-art offline RL method: CQL by 35.28% and the IL method: BC by 22.22%.
翻译:近年来,强化学习和模仿学习在机器人控制领域取得了显著进展。然而,这些方法在需要通过动作的新组合来完成新任务时表现出明显的性能下降。强化学习严重依赖奖励函数,但奖励函数难以对新任务进行良好泛化;而模仿学习则受限于专家演示,无法覆盖新任务。相比之下,人类可以利用从任务无关经验中积累的碎片化知识轻松完成这些任务。受此启发,本文提出了一种任务无关的学习方法(简称TAL),能够从任务无关数据中学习碎片化知识,从而完成新任务。TAL包含四个阶段:首先,执行任务无关的探索,从与环境的交互中收集数据,并通过知识图谱组织收集到的数据。与以往的序列式结构相比,知识图谱表示更加紧凑,更适用于环境探索。其次,提出并训练一个动作特征提取器,利用收集的知识图谱数据进行任务无关的碎片化知识学习。第三,设计候选动作生成器,将动作特征提取器应用于新任务,生成多个候选动作集。最后,设计动作提议模块,根据环境信息为新任务中的动作生成概率,并利用这些概率从多个候选动作集中选择要执行的动作,以形成规划方案。在虚拟室内场景上的实验表明,所提方法比最先进的离线强化学习方法CQL性能提升35.28%,比模仿学习方法BC性能提升22.22%。