Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a monotonic function of the return defined as the minimum compute required to achieve the given return across a family of models of different sizes. We find that, across a range of environments, intrinsic performance scales as a power law in model size and environment interactions. Consequently, as in generative modeling, the optimal model size scales as a power law in the training compute budget. Furthermore, we study how this relationship varies with the environment and with other properties of the training setup. In particular, using a toy MNIST-based environment, we show that varying the "horizon length" of the task mostly changes the coefficient but not the exponent of this relationship.
翻译:近期研究表明,在生成式建模中,交叉熵损失随模型规模和训练计算量呈平滑改善趋势,遵循幂律加常数的缩放定律。将这些结果推广至强化学习的主要挑战在于,其核心性能指标——平均回合回报——未必呈现平滑变化。为此,我们引入*内在性能*概念,即通过不同规模模型族实现给定回报所需的最小计算量——该指标为回报的单调函数。我们发现,在多种环境中,内在性能随模型规模与环境交互次数呈幂律缩放。因此,与生成式建模类似,最优模型规模随训练计算预算呈幂律缩放。此外,我们探究了该关系如何随环境及训练设置其他属性变化。特别地,基于简易的MNIST环境,我们证实任务"视野长度"的变动主要改变该关系的系数而非指数。