Sample efficiency in Reinforcement Learning (RL) has traditionally been driven by algorithmic enhancements. In this work, we demonstrate that scaling can also lead to substantial improvements. We conduct a thorough investigation into the interplay of scaling model capacity and domain-specific RL enhancements. These empirical findings inform the design choices underlying our proposed BRO (Bigger, Regularized, Optimistic) algorithm. The key innovation behind BRO is that strong regularization allows for effective scaling of the critic networks, which, paired with optimistic exploration, leads to superior performance. BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms across 40 complex tasks from the DeepMind Control, MetaWorld, and MyoSuite benchmarks. BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.
翻译:强化学习中的样本效率传统上主要由算法改进驱动。本文证明,规模化同样能带来显著提升。我们对模型容量扩展与领域特定强化学习增强之间的相互作用进行了深入研究。这些实证发现为我们提出的BRO(更大、正则化、乐观)算法的设计选择提供了依据。BRO的核心创新在于:强正则化使得评论家网络能够有效扩展,结合乐观探索策略,从而获得卓越性能。BRO在DeepMind Control、MetaWorld和MyoSuite基准测试的40项复杂任务中取得了最先进的结果,显著优于领先的基于模型和无模型算法。BRO是首个在极具挑战性的Dog和Humanoid任务中实现接近最优策略的无模型算法。