How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.
翻译:强化学习(RL)策略可利用的计算量如何影响其学习过程?使用固定参数量的策略是否仍能从额外计算中获益?标准RL框架并未提供正式回答这些问题的语言。实证中,深度RL策略通常以静态架构的神经网络进行参数化,这混淆了计算量与参数量。本文中,我们形式化定义了计算受限策略,并证明使用更多计算量的策略能够解决计算量较少策略无法处理的问题,并泛化至更长期限的任务。基于算法学习和无模型规划的前期工作,我们提出了一种能够使用可变计算量的最小化架构。实验部分与理论相互印证。在涵盖在线与离线RL的31项不同任务中,我们证明:(1)该架构仅通过使用更多计算量即可实现更强性能;(2)与使用多达5倍参数量的标准前馈网络或深度残差网络相比,该架构在更长期限测试任务上表现出更强的泛化能力。