We present an online model-based reinforcement learning algorithm suitable for controlling complex robotic systems directly in the real world. Unlike prevailing sim-to-real pipelines that rely on extensive offline simulation and model-free policy optimization, our method builds a dynamics model from real-time interaction data and performs policy updates guided by the learned dynamics model. This efficient model-based reinforcement learning scheme significantly reduces the number of samples to train control policies, enabling direct training on real-world rollout data. This significantly reduces the influence of bias in the simulated data, and facilitates the search for high-performance control policies. We adopt online optimization analysis to derive sublinear regret bounds under stochastic online optimization assumptions, providing formal guarantees on performance improvement as more interaction data are collected. Experimental evaluations were performed on a hydraulic excavator arm and a soft robot arm, where the algorithm demonstrates strong sample efficiency compared to model-free reinforcement learning methods, reaching comparable performance within hours. Robust adaptation to shifting dynamics was also observed when the payload condition was randomized. Our approach paves the way toward efficient and reliable on-robot learning for a broad class of challenging control tasks.
翻译:我们提出一种可直接在真实世界中控制复杂机器人系统的在线模型驱动强化学习算法。与依赖大量离线仿真与无模型策略优化的传统模拟到现实迁移管线不同,本方法通过实时交互数据构建动力学模型,并基于该学习模型进行策略更新。这种高效模型驱动强化学习方案显著降低了训练控制策略所需的样本量,使得策略可直接基于真实世界滚动数据训练。该方法有效减少了仿真数据偏差的影响,并促进了对高性能控制策略的搜索。我们采用在线优化分析框架,在随机在线优化假设下推导出次线性遗憾界,为性能随交互数据增加而改进提供了形式化保障。在液压挖掘机臂与软体机器人臂上的实验评估表明,相较于无模型强化学习方法,该算法展现出卓越的样本效率,可在数小时内达到相当的性能水平。当有效载荷条件随机变化时,算法同样展现出对动态环境变化的鲁棒适应能力。本研究为广泛具有挑战性的控制任务实现高效可靠的机器人本体学习开辟了新路径。