"Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow

Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $\$187$ and $\$800$ each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.

翻译：大型预训练神经语言模型为自然语言处理和软件工程领域带来了巨大进步。OpenAI的GPT系列模型如今在规模上已远超此前为众多NLP应用树立新标杆的Google BERT和Meta RoBERTa。这些模型基于网络爬取的海量异构数据语料库进行训练，使其能够学习通用语言模式与语义关系。然而，最大规模的模型不仅训练和部署成本高昂，且通常闭源，导致我们无法获取其数据与设计决策。我们认为，这种向大型通用模型发展的趋势应辅以专注于单一任务、规模适度的预训练模型。本研究以StackOverflow（SO）为例——该领域拥有海量富含对齐代码与文本数据。我们采用训练大型语言模型的标准实践，包括超大上下文长度（2,048 tokens）、批次大小（0.5M tokens）和训练集（27B tokens），配合强大的工具包Megatron-LM，训练了两个模型：参数为1.09亿的SOBertBase与参数为7.62亿的SOBertLarge，各自预算仅需187美元和800美元。我们在四项SO专属下游任务（问题质量预测、已关闭问题预测、命名实体识别及新引入的过时性预测）中，将模型性能与此前仅在SO数据上训练的最优模型、通用BERT模型及OpenAI ChatGPT进行对比。我们的模型不仅持续超越所有基线，且较小模型通常已能取得优异结果。两个模型均已开源发布。这些结果表明，在领域内数据上进行充分且规范的预训练，能够为替代闭源通用模型提供强大且经济可行的方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日