Reinforcement Learning (RL) has shown its remarkable and generalizable capability in legged locomotion through sim-to-real transfer. However, while adaptive methods like domain randomization are expected to make policy more robust to diverse environments, such comprehensiveness potentially detracts from the policy's performance in any specific environment according to the No Free Lunch theorem, leading to a suboptimal solution once deployed in the real world. To address this issue, we propose a lifelong policy adaptation framework named LoopSR, which utilizes a transformer-based encoder to project real-world trajectories into a latent space, and accordingly reconstruct the real-world environments back in simulation for further improvement. Autoencoder architecture and contrastive learning methods are adopted to better extract the characteristics of real-world dynamics. The simulation parameters for continual training are derived by combining predicted parameters from the decoder with retrieved parameters from the simulation trajectory dataset. By leveraging the continual training, LoopSR achieves superior data efficiency compared with strong baselines, with only a limited amount of data to yield eminent performance in both sim-to-sim and sim-to-real experiments.
翻译:强化学习(RL)通过仿真到现实的迁移,在足式运动控制中展现出卓越的泛化能力。然而,尽管领域随机化等自适应方法有望提升策略对多样化环境的鲁棒性,但根据“没有免费午餐”定理,这种全面性可能会削弱策略在特定环境中的性能,导致部署至现实世界时产生次优解。为解决该问题,我们提出名为LoopSR的终身策略适应框架:该框架利用基于Transformer的编码器将现实世界轨迹映射至隐空间,并据此在仿真环境中重构现实场景以进行持续优化。我们采用自编码器架构与对比学习方法以更有效地提取现实动态特征。持续训练所需的仿真参数通过解码器预测参数与仿真轨迹数据集中检索参数的融合获得。通过持续训练机制,LoopSR在仿真间及仿真到现实的实验中仅需有限数据量即可实现卓越性能,其数据效率显著优于现有基线方法。