Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples $N$ parameter perturbations at random, selects the top $K$, and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post-training methods such as PPO, GRPO, and ES for contemporary large-scale models.
翻译:预训练通常产生一个学习到的参数向量,该向量通常被视为进一步迭代适应的起点。在本工作中,我们将预训练的结果视为参数向量上的一个分布,其支撑集已包含任务特定的专家解。我们证明,在小型模型中,此类专家解仅占据该分布体积的极小部分,使得它们的发现依赖于梯度下降等结构化优化方法。相比之下,在大型、充分预训练的模型中,任务专家的密度急剧增加,以至于多样化、能提升任务性能的专家占据了预训练权重邻域中相当大的一部分。受此观点启发,我们探索了一种简单、完全并行的后训练方法:随机采样 $N$ 个参数扰动,选择其中最优的 $K$ 个,并通过多数投票集成预测。尽管方法简单,但该方法在当代大规模模型上能与 PPO、GRPO 和 ES 等标准后训练方法相竞争。