Learning control policies in simulation enables rapid, safe, and cost-effective development of advanced robotic capabilities. However, transferring these policies to the real world remains difficult due to the sim-to-real gap, where unmodeled dynamics and environmental disturbances can degrade policy performance. Existing approaches, such as domain randomization and Real2Sim2Real pipelines, can improve policy robustness, but either struggle under out-of-distribution conditions or require costly offline retraining. In this work, we approach these problems from a different perspective. Instead of relying on diverse training conditions before deployment, we focus on rapidly adapting the learned policy in the real world in an online fashion. To achieve this, we propose a novel online adaptive learning framework that unifies residual dynamics learning with real-time policy adaptation inside a differentiable simulation. Starting from a simple dynamics model, our framework refines the model continuously with real-world data to capture unmodeled effects and disturbances such as payload changes and wind. The refined dynamics model is embedded in a differentiable simulation framework, enabling gradient backpropagation through the dynamics and thus rapid, sample-efficient policy updates beyond the reach of classical RL methods like PPO. All components of our system are designed for rapid adaptation, enabling the policy to adjust to unseen disturbances within 5 seconds of training. We validate the approach on agile quadrotor control under various disturbances in both simulation and the real world. Our framework reduces hovering error by up to 81% compared to L1-MPC and 55% compared to DATT, while also demonstrating robustness in vision-based control without explicit state estimation.
翻译:在仿真环境中学习控制策略能够快速、安全且经济高效地开发先进的机器人能力。然而,由于仿真到现实的差距(即未建模动力学和环境扰动可能导致策略性能下降),将这些策略迁移到现实世界仍然困难。现有方法,如领域随机化和Real2Sim2Real流程,可以提升策略的鲁棒性,但要么在分布外条件下表现不佳,要么需要昂贵的离线重新训练。在本工作中,我们从不同视角处理这些问题。我们不再依赖于部署前多样化的训练条件,而是专注于在现实世界中以在线方式快速适应已学习的策略。为实现这一目标,我们提出了一种新颖的在线自适应学习框架,该框架将残差动力学学习与实时策略适应统一在一个可微分仿真环境中。从一个简单的动力学模型出发,我们的框架利用真实世界数据持续优化模型,以捕捉未建模效应和扰动(如负载变化和风力)。优化后的动力学模型嵌入可微分仿真框架,使得梯度能够通过动力学反向传播,从而实现超越PPO等经典强化学习方法的快速、样本高效策略更新。我们系统的所有组件均设计用于快速适应,使策略能够在5秒的训练时间内调整到未见的扰动。我们在仿真和现实世界中,针对多种扰动下的敏捷四旋翼飞行器控制验证了该方法。与L1-MPC相比,我们的框架将悬停误差降低了高达81%,与DATT相比降低了55%,同时在无需显式状态估计的视觉控制中也展现了鲁棒性。