We design and evaluate a Bayesian optimization framework for resource efficient pre-training of Transformer-based language models (TLMs). TLM pre-training requires high computational resources and introduces many unresolved design choices, such as selecting its pre-training hyperparameters. We propose a multi-armed bandit framework for the sequential selection of TLM pre-training hyperparameters, aimed at optimizing language model performance, in a resource efficient manner. We design a Thompson sampling algorithm, with a surrogate Gaussian process reward model of the Masked Language Model (MLM) pre-training objective, for its sequential minimization. Instead of MLM pre-training with fixed masking probabilities, the proposed Gaussian process-based Thompson sampling (GP-TS) accelerates pre-training by sequentially selecting masking hyperparameters that improve performance. We empirically demonstrate how GP-TS pre-trains language models efficiently, i.e., it achieves lower MLM loss in fewer epochs, across a variety of settings. In addition, GP-TS pre-trained TLMs attain competitive downstream performance, while avoiding expensive hyperparameter grid search. GP-TS provides an interactive framework for efficient and optimized TLM pre-training that, by circumventing costly hyperparameter selection, enables substantial computational savings.
翻译:我们设计并评估了一个用于基于Transformer的语言模型(TLM)资源高效预训练的贝叶斯优化框架。TLM预训练需要较高的计算资源,并引入了许多尚未解决的设计选择,例如选择其预训练超参数。我们提出了一种多臂赌博机框架,用于顺序选择TLM预训练超参数,旨在以资源高效的方式优化语言模型性能。我们设计了一个汤普森采样算法,并采用掩码语言模型(MLM)预训练目标的替代高斯过程奖励模型,以实现其顺序最小化。不同于固定掩码概率的MLM预训练,所提出的基于高斯过程的汤普森采样(GP-TS)通过顺序选择能够提升性能的掩码超参数来加速预训练。我们通过实验证明,GP-TS能在多种设置下高效预训练语言模型,即在更少的训练周期内实现更低的MLM损失。此外,经过GP-TS预训练的TLM在下游任务上取得了具有竞争力的性能,同时避免了昂贵的超参数网格搜索。GP-TS提供了一个交互式框架,用于高效且优化的TLM预训练,通过规避高成本的超参数选择,实现了显著的计算资源节省。