Large language models are typically trained densely: all parameters are updated with respect to all inputs. This requires synchronization of billions of parameters across thousands of GPUs. We introduce a simple but effective method to asynchronously train large, sparse language models on arbitrary text corpora. Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference. This approach generalizes embarrassingly parallel training by automatically discovering the domains for each expert, and eliminates nearly all the communication overhead of existing sparse language models. Our technique outperforms dense baselines on multiple corpora and few-shot tasks, and our analysis shows that specializing experts to meaningful clusters is key to these gains. Performance also improves with the number of experts and size of training data, suggesting this is a highly efficient and accessible approach to training large language models.
翻译:大型语言模型通常采用密集训练方式:所有参数随所有输入更新,这要求数千个GPU之间同步数十亿参数。我们提出一种简单有效的方法,可在任意文本语料库上异步训练大规模稀疏语言模型。该方法将语料库聚类为相关文档集合,对每个聚类独立训练专家语言模型,并在推理时将其组合为稀疏集成。该方案通过自动发现每个专家的领域,将易并行训练进行泛化,几乎消除了现有稀疏语言模型的所有通信开销。我们的技术在多个语料库和少样本任务上优于密集基线,分析表明将专家专业化至有意义的聚类是实现这些提升的关键。性能随专家数量及训练数据规模同步提升,表明这是一种高效且易部署的大规模语言模型训练方法。