Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $\$187$ and $\$800$ each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
翻译:大型预训练神经语言模型为自然语言处理和软件工程领域带来了巨大进步。OpenAI的GPT系列模型如今在规模上已远超此前为众多NLP应用树立新标杆的Google BERT和Meta RoBERTa。这些模型基于网络爬取的海量异构数据语料库进行训练,使其能够学习通用语言模式与语义关系。然而,最大规模的模型不仅训练和部署成本高昂,且通常闭源,导致我们无法获取其数据与设计决策。我们认为,这种向大型通用模型发展的趋势应辅以专注于单一任务、规模适度的预训练模型。本研究以StackOverflow(SO)为例——该领域拥有海量富含对齐代码与文本数据。我们采用训练大型语言模型的标准实践,包括超大上下文长度(2,048 tokens)、批次大小(0.5M tokens)和训练集(27B tokens),配合强大的工具包Megatron-LM,训练了两个模型:参数为1.09亿的SOBertBase与参数为7.62亿的SOBertLarge,各自预算仅需187美元和800美元。我们在四项SO专属下游任务(问题质量预测、已关闭问题预测、命名实体识别及新引入的过时性预测)中,将模型性能与此前仅在SO数据上训练的最优模型、通用BERT模型及OpenAI ChatGPT进行对比。我们的模型不仅持续超越所有基线,且较小模型通常已能取得优异结果。两个模型均已开源发布。这些结果表明,在领域内数据上进行充分且规范的预训练,能够为替代闭源通用模型提供强大且经济可行的方案。