Given a budget on total model size, one must decide whether to train a single, large neural network or to combine the predictions of many smaller networks. We study this trade-off for ensembles of random-feature ridge regression models. We prove that when a fixed number of trainable parameters are partitioned among $K$ independently trained models, $K=1$ achieves optimal performance, provided the ridge parameter is optimally tuned. We then derive scaling laws which describe how the test risk of an ensemble of regression models decays with its total size. We identify conditions on the kernel and task eigenstructure under which ensembles can achieve near-optimal scaling laws. Training ensembles of deep convolutional neural networks on CIFAR-10 and a transformer architecture on C4, we find that a single large network outperforms any ensemble of networks with the same total number of parameters, provided the weight decay and feature-learning strength are tuned to their optimal values.
翻译:给定模型总规模预算,必须决定是训练单个大型神经网络还是组合多个较小网络的预测。本文针对随机特征岭回归模型集成研究这一权衡问题。我们证明当固定数量的可训练参数被分配至$K$个独立训练的模型时,若岭参数经过最优调谐,则$K=1$可实现最优性能。随后推导描述回归模型集成测试风险随总规模衰减的缩放定律,并确定在何种核函数与任务特征结构条件下集成能够达到近似最优的缩放定律。通过在CIFAR-10数据集上训练深度卷积神经网络集成及在C4数据集上训练Transformer架构,我们发现当权重衰减与特征学习强度调至最优值时,单个大型网络的表现优于具有相同参数总量的任何网络集成。