Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a monotonic function of the return defined as the minimum compute required to achieve the given return across a family of models of different sizes. We find that, across a range of environments, intrinsic performance scales as a power law in model size and environment interactions. Consequently, as in generative modeling, the optimal model size scales as a power law in the training compute budget. Furthermore, we study how this relationship varies with the environment and with other properties of the training setup. In particular, using a toy MNIST-based environment, we show that varying the "horizon length" of the task mostly changes the coefficient but not the exponent of this relationship.
翻译:近期研究表明,在生成建模中,交叉熵损失随模型规模和训练计算量呈平滑改善,遵循幂律加常数尺度定律。将这一结论扩展到强化学习的主要挑战在于,其核心性能指标——平均回合回报——不一定平滑变化。为解决此问题,我们引入*内在性能*,即定义为在给定回报下,不同规模模型族中实现该回报所需的最小计算量,该指标是回报的单调函数。我们发现,在多种环境中,内在性能随模型规模和环境交互次数呈幂律缩放。因此,与生成建模类似,最优模型规模随训练计算预算呈幂律缩放。此外,我们研究了这种关系如何随环境和训练设置的其他属性变化。特别是,通过一个基于MNIST的玩具环境,我们表明任务“时间跨度”的调整主要改变该关系的系数而非指数。