Domain-Specific Code Language Models: Unraveling the Potential for HPC Codes and Tasks

Tal Kadosh,Niranjan Hasabnis,Vy A. Vo,Nadav Schneider,Neva Krien,Mihai Capota,Abdul Wasay,Nesreen Ahmed,Ted Willke,Guy Tamir,Yuval Pinter,Timothy Mattson,Gal Oren

With easier access to powerful compute resources, there is a growing trend in AI for software development to develop larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because these LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages. We found this design choice confusing - why do we need large LMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question choices made by existing LLMs by developing smaller LMs for specific domains - we call them domain-specific LMs. Specifically, we start off with HPC as a domain and build an HPC-specific LM, named MonoCoder, that is orders of magnitude smaller than existing LMs but delivers similar, if not better performance, on non-HPC and HPC tasks. Specifically, we pre-trained MonoCoder on an HPC-specific dataset (named HPCorpus) of C and C++ programs mined from GitHub. We evaluated the performance of MonoCoder against conventional multi-lingual LLMs. Results demonstrate that MonoCoder, although much smaller than existing LMs, achieves similar results on normalized-perplexity tests and much better ones in CodeBLEU competence for high-performance and parallel code generations. Furthermore, fine-tuning the base model for the specific task of parallel code generation (OpenMP parallel for pragmas) demonstrates outstanding results compared to GPT, especially when local misleading semantics are removed by our novel pre-processor Tokompiler, showcasing the ability of domain-specific models to assist in HPC-relevant tasks.

翻译：随着强大计算资源的更易获取，人工智能在软件开发领域正呈现一种趋势，即开发更大的语言模型（LLMs）以应对多种编程任务。即使应用于高性能计算（HPC）领域任务的LLMs也规模庞大，且训练需要昂贵的计算资源。这部分是因为这些用于HPC任务的LLM是通过微调支持多种自然语言和/或编程语言的现有LLM获得的。我们发现这一设计选择令人困惑——为何要为HPC特定任务使用那些基于与HPC无关的自然语言和编程语言训练的大模型？在本系列工作中，我们旨在通过为特定领域开发更小的LM（我们称之为领域特定LM）来质疑现有LLM所做的选择。具体而言，我们以HPC为起点构建了一个HPC特定的LM——命名为MonoCoder——其规模比现有LM小数个数量级，但在非HPC和HPC任务上却能达到相似甚至更优的性能。我们使用从GitHub挖掘的C和C++程序构成的HPC特定数据集（名为HPCorpus）对MonoCoder进行了预训练。我们将MonoCoder的性能与传统多语言LLM进行了评估。结果表明，尽管MonoCoder比现有LM小得多，但在标准化困惑度测试中取得了相似结果，而在高性能与并行代码生成的CodeBLEU能力上则表现更优。此外，针对并行代码生成（OpenMP并行for指令）这一特定任务对基础模型进行微调，相比GPT展现出卓越效果——尤其在通过我们新颖的预处理器Tokompiler移除局部误导语义时，充分展示了领域特定模型辅助HPC相关任务的能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日