Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models

Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $187 and $800 each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.

翻译：大型预训练神经语言模型为自然语言处理和软件工程领域带来了巨大进步。OpenAI的GPT系列模型如今已超越了此前在多项NLP应用中创下新基准的Google BERT和Meta RoBERTa。这些模型基于网络爬取的海量异构数据训练而成，使其能够学习通用语言模式和语义关系。然而，最大的模型不仅训练和部署成本高昂，且通常为闭源，导致我们无法获取其数据和设计决策。我们认为，这种追求大规模通用模型的趋势应辅以单一用途、规模适中的预训练模型。本文以StackOverflow（SO）为例，该领域拥有海量高质量对齐的代码与文本数据。我们采用预训练大语言模型的标准实践，包括使用极大上下文规模（2,048个token）、批量大小（50万个token）和训练集（270亿个token），并结合强大工具包Megatron-LM，训练了两个模型：参数为1.09亿的SOBertBase和参数为7.62亿的SOBertLarge，预算分别仅为187美元和800美元。我们在四个SO特定下游任务——问题质量预测、关闭问题预测、命名实体识别和废弃预测（本文引入的新任务）上，将模型性能与先前仅在SO数据上训练的SOTA模型以及通用BERT模型和OpenAI的ChatGPT进行对比。我们的模型不仅持续优于所有基线模型，较小的模型也往往能取得强健结果。两个模型均已向公众发布。这些结果表明，对领域内数据进行充分且恰当的预训练，能够生成强大且可负担的闭源通用模型替代方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日