How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.
翻译:计算资源对强化学习策略的影响如何?使用固定数量参数的政策是否仍能受益于额外计算?标准强化学习框架尚未提供正式回答这些问题的语言。经验表明,深度强化学习策略常被参数化为具有静态架构的神经网络,将计算量与参数数量混为一谈。本文形式化定义了计算受限策略,并证明使用更多计算的策略能够解决标准计算量较低策略无法触及的问题,并在更长跨度任务中实现泛化。基于算法学习与无模型规划的既有研究,我们提出一种支持可变计算量的最小化架构。实验结果验证了理论分析:在涵盖在线与离线强化学习的31项不同任务中,(1)该架构仅通过增加计算量即可实现更强性能,(2)相较于使用多达5倍参数的标准前馈网络或深度残差网络,该架构在长跨度测试任务中展现出更强的泛化能力。