基于在线演示的机器人策略迁移：一种主动强化学习方法 (Robot Policy Transfer with Online Demonstrations: An Active Reinforcement Learning Approach)

Transfer Learning (TL) is a powerful tool that enables robots to transfer learned policies across different environments, tasks, or embodiments. To further facilitate this process, efforts have been made to combine it with Learning from Demonstrations (LfD) for more flexible and efficient policy transfer. However, these approaches are almost exclusively limited to offline demonstrations collected before policy transfer starts, which may suffer from the intrinsic issue of covariance shift brought by LfD and harm the performance of policy transfer. Meanwhile, extensive work in the learning-from-scratch setting has shown that online demonstrations can effectively alleviate covariance shift and lead to better policy performance with improved sample efficiency. This work combines these insights to introduce online demonstrations into a policy transfer setting. We present Policy Transfer with Online Demonstrations, an active LfD algorithm for policy transfer that can optimize the timing and content of queries for online episodic expert demonstrations under a limited demonstration budget. We evaluate our method in eight robotic scenarios, involving policy transfer across diverse environment characteristics, task objectives, and robotic embodiments, with the aim to transfer a trained policy from a source task to a related but different target task. The results show that our method significantly outperforms all baselines in terms of average success rate and sample efficiency, compared to two canonical LfD methods with offline demonstrations and one active LfD method with online demonstrations. Additionally, we conduct preliminary sim-to-real tests of the transferred policy on three transfer scenarios in the real-world environment, demonstrating the policy effectiveness on a real robot manipulator.

翻译：迁移学习是一种强大的工具，能够使机器人在不同环境、任务或本体之间迁移已学习的策略。为进一步促进这一过程，研究者已尝试将其与演示学习相结合，以实现更灵活高效的策略迁移。然而，这些方法几乎完全局限于策略迁移开始前收集的离线演示，这可能受到演示学习固有的协变量偏移问题影响，从而损害策略迁移的性能。同时，在从零开始学习场景中的大量研究表明，在线演示能有效缓解协变量偏移，并以更高的样本效率获得更优的策略性能。本研究融合这些见解，将在线演示引入策略迁移场景。我们提出了基于在线演示的策略迁移方法，这是一种用于策略迁移的主动演示学习算法，能够在有限的演示预算下优化在线片段式专家演示的查询时机与内容。我们在八种机器人场景中评估所提方法，这些场景涉及跨不同环境特征、任务目标和机器人本体的策略迁移，旨在将训练好的策略从源任务迁移至相关但不同的目标任务。结果表明，与两种采用离线演示的经典演示学习方法及一种采用在线演示的主动演示学习方法相比，我们的方法在平均成功率和样本效率方面均显著优于所有基线。此外，我们在真实环境中的三个迁移场景上对迁移策略进行了初步的仿真到现实测试，验证了策略在真实机器人机械臂上的有效性。