Leveraging Optimal Transport for Enhanced Offline Reinforcement Learning in Surgical Robotic Environments

Most Reinforcement Learning (RL) methods are traditionally studied in an active learning setting, where agents directly interact with their environments, observe action outcomes, and learn through trial and error. However, allowing partially trained agents to interact with real physical systems poses significant challenges, including high costs, safety risks, and the need for constant supervision. Offline RL addresses these cost and safety concerns by leveraging existing datasets and reducing the need for resource-intensive real-time interactions. Nevertheless, a substantial challenge lies in the demand for these datasets to be meticulously annotated with rewards. In this paper, we introduce Optimal Transport Reward (OTR) labelling, an innovative algorithm designed to assign rewards to offline trajectories, using a small number of high-quality expert demonstrations. The core principle of OTR involves employing Optimal Transport (OT) to calculate an optimal alignment between an unlabeled trajectory from the dataset and an expert demonstration. This alignment yields a similarity measure that is effectively interpreted as a reward signal. An offline RL algorithm can then utilize these reward signals to learn a policy. This approach circumvents the need for handcrafted rewards, unlocking the potential to harness vast datasets for policy learning. Leveraging the SurRoL simulation platform tailored for surgical robot learning, we generate datasets and employ them to train policies using the OTR algorithm. By demonstrating the efficacy of OTR in a different domain, we emphasize its versatility and its potential to expedite RL deployment across a wide range of fields.

翻译：大多数强化学习方法传统上是在主动学习环境中研究的，其中智能体直接与环境交互，观察动作结果，并通过试错进行学习。然而，允许部分训练的智能体与真实物理系统交互会带来重大挑战，包括高成本、安全风险以及需要持续监督。离线强化学习通过利用现有数据集并减少对资源密集型实时交互的需求，解决了这些成本和安全问题。尽管如此，一个重大挑战在于这些数据集需要精心标注奖励。在本文中，我们引入了最优传输奖励标注，这是一种创新算法，旨在利用少量高质量专家演示为离线轨迹分配奖励。OTR的核心原理涉及使用最优传输来计算数据集中未标注轨迹与专家演示之间的最优对齐。这种对齐产生一个相似性度量，可有效解释为奖励信号。离线强化学习算法随后可以利用这些奖励信号来学习策略。这种方法避免了手工设计奖励的需求，释放了利用大量数据集进行策略学习的潜力。利用专为手术机器人学习设计的SurRoL仿真平台，我们生成数据集并使用OTR算法训练策略。通过在不同领域展示OTR的有效性，我们强调了其多功能性及加速强化学习在广泛领域部署的潜力。