Sim-to-real transfer trains RL agents in the simulated environments and then deploys them in the real world. Sim-to-real transfer has been widely used in practice because it is often cheaper, safer and much faster to collect samples in simulation than in the real world. Despite the empirical success of the sim-to-real transfer, its theoretical foundation is much less understood. In this paper, we study the sim-to-real transfer in continuous domain with partial observations, where the simulated environments and real-world environments are modeled by linear quadratic Gaussian (LQG) systems. We show that a popular robust adversarial training algorithm is capable of learning a policy from the simulated environment that is competitive to the optimal policy in the real-world environment. To achieve our results, we design a new algorithm for infinite-horizon average-cost LQGs and establish a regret bound that depends on the intrinsic complexity of the model class. Our algorithm crucially relies on a novel history clipping scheme, which might be of independent interest.
翻译:仿真到现实迁移在模拟环境中训练强化学习智能体,随后将其部署到真实世界。由于在仿真中采集样本通常比真实世界更廉价、更安全且更快速,该技术已在实践中广泛应用。尽管仿真到现实迁移取得了经验性成功,但其理论基础尚不清晰。本文研究部分观测连续域中的仿真到现实迁移问题,其中模拟环境与真实环境通过线性二次高斯系统建模。我们证明,一种流行的鲁棒对抗训练算法能够从仿真环境中学习到与真实环境最优策略相竞争的策略。为获得上述结果,我们针对无限时域平均代价线性二次高斯系统设计了新算法,并建立了依赖于模型类内在复杂度的遗憾界。该算法的关键创新在于采用新颖的历史剪裁方案,该方案本身可能具有独立研究价值。