We introduce EfficientTDMPC, a sample-efficient model-based reinforcement learning method for continuous control built on the TD-MPC family of algorithms. Central to this family is a planner that aims to find an action sequence that maximizes the estimated return. The return is estimated using a learned model and value networks, each of which can introduce error. EfficientTDMPC proposes to reduce this error in two ways. First, it introduces an ensemble of dynamics models and averages the return estimates across those models and across different rollout depths. Second, it adds the option to apply an uncertainty penalty to the planner objective, yielding a planner that avoids actions with uncertain return estimates. It then adds practical improvements which increase buffer data freshness and reduce compute. Lastly, we find that our contributions enable EfficientTDMPC to benefit more from a higher update-to-data (UTD) ratio, further improving sample efficiency. To the best of our knowledge, in the low data regime of each benchmark, EfficientTDMPC achieves state-of-the-art (SOTA) in terms of sample efficiency on HumanoidBench-Hard and DMC hard, while matching SOTA on DMC easy.
翻译:我们提出EfficientTDMPC——一种基于TD-MPC算法族的样本高效模型驱动强化学习方法,专为连续控制任务设计。该算法族的核心是一个规划器,旨在寻找能最大化预估收益的动作序列。收益估计依赖于学习得到的模型与价值网络,但二者均可能引入误差。EfficientTDMPC通过两种方式降低该误差:首先,引入动力学模型集成,对不同模型及其不同展开深度下的收益估计取平均;其次,在规划器目标中增加不确定性惩罚项,使规划器规避收益估计不确定性高的动作。此外,算法添加了提升缓冲数据新鲜度与降低计算量的实用改进。最后,我们发现这些改进使EfficientTDMPC能从更高的更新-数据比(UTD)中获益,从而进一步提升样本效率。据我们所知,在各项基准测试的低数据量条件下,EfficientTDMPC在HumanoidBench-Hard与DMC hard任务中实现了样本效率最先进(SOTA)水平,同时与DMC easy任务的最先进水平持平。