With easier access to powerful compute resources, there is a growing trend in AI for software development to develop larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because these LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages. We found this design choice confusing - why do we need large LMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question choices made by existing LLMs by developing smaller LMs for specific domains - we call them domain-specific LMs. Specifically, we start off with HPC as a domain and build an HPC-specific LM, named MonoCoder, that is orders of magnitude smaller than existing LMs but delivers similar, if not better performance, on non-HPC and HPC tasks. Specifically, we pre-trained MonoCoder on an HPC-specific dataset (named HPCorpus) of C and C++ programs mined from GitHub. We evaluated the performance of MonoCoder against conventional multi-lingual LLMs. Results demonstrate that MonoCoder, although much smaller than existing LMs, achieves similar results on normalized-perplexity tests and much better ones in CodeBLEU competence for high-performance and parallel code generations. Furthermore, fine-tuning the base model for the specific task of parallel code generation (OpenMP parallel for pragmas) demonstrates outstanding results compared to GPT, especially when local misleading semantics are removed by our novel pre-processor Tokompiler, showcasing the ability of domain-specific models to assist in HPC-relevant tasks.
翻译:随着强大计算资源的更易获取,人工智能在软件开发领域正呈现一种趋势,即开发更大的语言模型(LLMs)以应对多种编程任务。即使应用于高性能计算(HPC)领域任务的LLMs也规模庞大,且训练需要昂贵的计算资源。这部分是因为这些用于HPC任务的LLM是通过微调支持多种自然语言和/或编程语言的现有LLM获得的。我们发现这一设计选择令人困惑——为何要为HPC特定任务使用那些基于与HPC无关的自然语言和编程语言训练的大模型?在本系列工作中,我们旨在通过为特定领域开发更小的LM(我们称之为领域特定LM)来质疑现有LLM所做的选择。具体而言,我们以HPC为起点构建了一个HPC特定的LM——命名为MonoCoder——其规模比现有LM小数个数量级,但在非HPC和HPC任务上却能达到相似甚至更优的性能。我们使用从GitHub挖掘的C和C++程序构成的HPC特定数据集(名为HPCorpus)对MonoCoder进行了预训练。我们将MonoCoder的性能与传统多语言LLM进行了评估。结果表明,尽管MonoCoder比现有LM小得多,但在标准化困惑度测试中取得了相似结果,而在高性能与并行代码生成的CodeBLEU能力上则表现更优。此外,针对并行代码生成(OpenMP并行for指令)这一特定任务对基础模型进行微调,相比GPT展现出卓越效果——尤其在通过我们新颖的预处理器Tokompiler移除局部误导语义时,充分展示了领域特定模型辅助HPC相关任务的能力。