How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.
翻译:强化学习(RL)策略可用的计算量如何影响其学习?使用固定参数量的策略是否仍能受益于额外的计算?标准的RL框架并未提供正式回答这些问题的语言。在实证中,深度RL策略通常被参数化为具有静态架构的神经网络,这混淆了计算量与参数数量。在本文中,我们形式化了计算受限策略,并证明了使用更多计算量的策略能够解决那些计算量较少的策略无法处理的问题,并泛化到更长期限的任务。基于先前在算法学习和无模型规划方面的工作,我们提出了一种能够使用可变计算量的最小架构。我们的实验与理论相辅相成。在一组包含在线和离线RL的31个不同任务上,我们表明:$(1)$ 该架构仅通过使用更多计算量即可实现更强的性能;$(2)$ 与使用多达5倍参数的标准前馈网络或深度残差网络相比,该架构在更长期限的测试任务上表现出更强的泛化能力。