In offline reinforcement learning (RL), an RL agent learns to solve a task using only a fixed dataset of previously collected data. While offline RL has been successful in learning real-world robot control policies, it typically requires large amounts of expert-quality data to learn effective policies that generalize to out-of-distribution states. Unfortunately, such data is often difficult and expensive to acquire in real-world tasks. Several recent works have leveraged data augmentation (DA) to inexpensively generate additional data, but most DA works apply augmentations in a random fashion and ultimately produce highly suboptimal augmented experience. In this work, we propose Guided Data Augmentation (GuDA), a human-guided DA framework that generates expert-quality augmented data. The key insight behind GuDA is that while it may be difficult to demonstrate the sequence of actions required to produce expert data, a user can often easily characterize when an augmented trajectory segment represents progress toward task completion. Thus, a user can restrict the space of possible augmentations to automatically reject suboptimal augmented data. To extract a policy from GuDA, we use off-the-shelf offline reinforcement learning and behavior cloning algorithms. We evaluate GuDA on a physical robot soccer task as well as simulated D4RL navigation tasks, a simulated autonomous driving task, and a simulated soccer task. Empirically, GuDA enables learning given a small initial dataset of potentially suboptimal experience and outperforms a random DA strategy as well as a model-based DA strategy.
翻译:在离线强化学习(RL)中,RL智能体仅利用固定的先前收集数据集学习解决任务。尽管离线RL在真实世界机器人控制策略学习上取得了成功,但其通常需要大量专家级数据来学习能泛化至分布外状态的鲁棒策略。然而,在现实任务中获取此类数据往往困难且成本高昂。近期多项研究借助数据增强(DA)以低成本生成额外数据,但大多数DA方法随机应用增强操作,最终产生高度次优的增强经验。本文提出引导式数据增强(GuDA),一种生成专家级增强数据的人类引导型DA框架。GuDA的核心见解在于:尽管难以展示生成专家数据所需的完整动作序列,但用户通常能轻松判断何时某个增强轨迹片段表征了任务完成进展。因此,用户可通过约束增强空间自动拒绝次优增强数据。为从GuDA中提取策略,我们采用现成的离线强化学习与行为克隆算法。我们在物理机器人足球任务、模拟D4RL导航任务、模拟自动驾驶任务及模拟足球任务上评估GuDA。实验表明,GuDA能够在初始小规模且可能次优的数据集条件下实现有效学习,其性能优于随机DA策略及基于模型的DA策略。