Reinforcement Learning (RL) has recently achieved remarkable success in robotic control. However, most RL methods operate in simulated environments where privileged knowledge (e.g., dynamics, surroundings, terrains) is readily available. Conversely, in real-world scenarios, robot agents usually rely solely on local states (e.g., proprioceptive feedback of robot joints) to select actions, leading to a significant sim-to-real gap. Existing methods address this gap by either gradually reducing the reliance on privileged knowledge or performing a two-stage policy imitation. However, we argue that these methods are limited in their ability to fully leverage the privileged knowledge, resulting in suboptimal performance. In this paper, we propose a novel single-stage privileged knowledge distillation method called the Historical Information Bottleneck (HIB) to narrow the sim-to-real gap. In particular, HIB learns a privileged knowledge representation from historical trajectories by capturing the underlying changeable dynamic information. Theoretical analysis shows that the learned privileged knowledge representation helps reduce the value discrepancy between the oracle and learned policies. Empirical experiments on both simulated and real-world tasks demonstrate that HIB yields improved generalizability compared to previous methods.
翻译:强化学习(RL)近期在机器人控制领域取得了显著成功。然而,大多数RL方法在模拟环境中运行,其中特权知识(例如动力学、环境、地形)易于获取。相反,在真实世界场景中,机器人代理通常仅依赖局部状态(例如机器人关节的本体感知反馈)来选择动作,导致显著的仿真与真实环境差距。现有方法通过逐步减少对特权知识的依赖或执行两阶段策略模仿来解决这一差距。然而,我们认为这些方法在充分利用特权知识方面存在局限,导致性能次优。本文提出一种新颖的单阶段特权知识蒸馏方法,称为历史信息瓶颈(HIB),以缩小仿真与真实环境差距。具体而言,HIB通过捕获潜在的可变动态信息,从历史轨迹中学习特权知识表示。理论分析表明,所学到的特权知识表示有助于减少最优策略与学习策略之间的价值差异。在模拟和真实世界任务上的实证实验表明,与先前方法相比,HIB具有更好的泛化能力。