With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found this design choice confusing - why do we need large LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question design choices made by existing LLMs by developing smaller LLMs for specific domains - we call them domain-specific LLMs. Specifically, we start off with HPC as a domain and propose a novel tokenizer named Tokompiler, designed specifically for preprocessing code in HPC and compilation-centric tasks. Tokompiler leverages knowledge of language primitives to generate language-oriented tokens, providing a context-aware understanding of code structure while avoiding human semantics attributed to code structures completely. We applied Tokompiler to pre-train two state-of-the-art models, SPT-Code and Polycoder, for a Fortran code corpus mined from GitHub. We evaluate the performance of these models against the conventional LLMs. Results demonstrate that Tokompiler significantly enhances code completion accuracy and semantic understanding compared to traditional tokenizers in normalized-perplexity tests, down to ~1 perplexity score. This research opens avenues for further advancements in domain-specific LLMs, catering to the unique demands of HPC and compilation tasks.
翻译:随着强大计算资源的易获取性提升,人工智能软件开发领域正趋向于构建更大规模的语言模型(LLM)以应对多样化的编程任务。即便是应用于高性能计算(HPC)领域的LLM,其规模也极为庞大(例如数十亿参数),训练过程中需要消耗昂贵的计算资源。我们发现这一设计选择令人困惑——为何需要基于与HPC无关的自然语言和编程语言训练的大模型来处理HPC特定任务?在本研究工作中,我们旨在通过为特定领域开发更小规模的LLM(称为领域专用LLM),质疑现有LLM的设计选择。具体而言,我们以HPC为切入点,提出一种名为Tokompiler的新型分词器,专门用于预处理与编译型任务相关的HPC代码。Tokompiler利用语言基元知识生成面向语言的分词结果,在完全摒弃代码结构中人为语义的前提下,实现对代码结构的上下文感知理解。我们应用Tokompiler对两个前沿模型(SPT-Code和Polycoder)进行预训练,训练语料为从GitHub采集的Fortran代码库。通过与常规LLM的性能对比评估发现:在归一化困惑度测试中,Tokompiler相较于传统分词器显著提升了代码补全准确率和语义理解能力,困惑度得分可降至约1。这项研究为满足HPC与编译任务的独特需求,开辟了领域专用LLM进一步发展的新路径。