Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $187 and $800 each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
翻译:大型预训练神经语言模型为自然语言处理和软件工程领域带来了巨大进步。OpenAI的GPT系列模型如今已超越了此前在多项NLP应用中创下新基准的Google BERT和Meta RoBERTa。这些模型基于网络爬取的海量异构数据训练而成,使其能够学习通用语言模式和语义关系。然而,最大的模型不仅训练和部署成本高昂,且通常为闭源,导致我们无法获取其数据和设计决策。我们认为,这种追求大规模通用模型的趋势应辅以单一用途、规模适中的预训练模型。本文以StackOverflow(SO)为例,该领域拥有海量高质量对齐的代码与文本数据。我们采用预训练大语言模型的标准实践,包括使用极大上下文规模(2,048个token)、批量大小(50万个token)和训练集(270亿个token),并结合强大工具包Megatron-LM,训练了两个模型:参数为1.09亿的SOBertBase和参数为7.62亿的SOBertLarge,预算分别仅为187美元和800美元。我们在四个SO特定下游任务——问题质量预测、关闭问题预测、命名实体识别和废弃预测(本文引入的新任务)上,将模型性能与先前仅在SO数据上训练的SOTA模型以及通用BERT模型和OpenAI的ChatGPT进行对比。我们的模型不仅持续优于所有基线模型,较小的模型也往往能取得强健结果。两个模型均已向公众发布。这些结果表明,对领域内数据进行充分且恰当的预训练,能够生成强大且可负担的闭源通用模型替代方案。