With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found this design choice confusing - why do we need large LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question design choices made by existing LLMs by developing smaller LLMs for specific domains - we call them domain-specific LLMs. Specifically, we start off with HPC as a domain and propose a novel tokenizer named Tokompiler, designed specifically for preprocessing code in HPC and compilation-centric tasks. Tokompiler leverages knowledge of language primitives to generate language-oriented tokens, providing a context-aware understanding of code structure while avoiding human semantics attributed to code structures completely. We applied Tokompiler to pre-train two state-of-the-art models, SPT-Code and Polycoder, for a Fortran code corpus mined from GitHub. We evaluate the performance of these models against the conventional LLMs. Results demonstrate that Tokompiler significantly enhances code completion accuracy and semantic understanding compared to traditional tokenizers in normalized-perplexity tests, down to ~1 perplexity score. This research opens avenues for further advancements in domain-specific LLMs, catering to the unique demands of HPC and compilation tasks.
翻译:随着强大计算资源的普及,人工智能软件开发领域正逐渐倾向于开发越来越大的语言模型(LLMs),以应对各类编程任务。即便是应用于高性能计算(HPC)领域的LLMs,其规模也十分庞大(例如数十亿参数),且训练时需要耗费昂贵的计算资源。我们发现这种设计选择令人困惑——为何需要针对HPC特定任务,使用那些在自然语言和与HPC无关的编程语言上训练的大规模LLMs?本系列研究旨在通过开发针对特定领域的较小LLMs(我们称之为领域专用LLMs),来质疑现有LLMs的设计选择。具体而言,我们以HPC为起点,提出了一种名为Tokompiler的新型分词器,专门用于预处理HPC代码及以编译为核心的任务。Tokompiler利用语言原语知识生成面向语言的分词结果,在完全避免代码结构附加人为语义的同时,提供对代码结构的上下文感知理解。我们将Tokompiler应用于两个先进模型——SPT-Code和Polycoder——的预训练,这些模型基于从GitHub收集的Fortran代码语料库。在标准化困惑度测试中,结果显示与传统的分词器相比,Tokompiler显著提升了代码补全的准确性和语义理解能力,将困惑度评分降至约1。本研究为领域专用LLMs的进一步发展开辟了道路,以满足HPC及编译任务的独特需求。