Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimization remains underexplored. Training LLMs is resource-intensive, requiring substantial GPU hours and extensive data collection, which can be prohibitive. To address this gap, we introduce Meta Large Language Model Compiler (LLM Compiler), a suite of robust, openly available, pre-trained models specifically designed for code optimization tasks. Built on the foundation of Code Llama, LLM Compiler enhances the understanding of compiler intermediate representations (IRs), assembly language, and optimization techniques. The model has been trained on a vast corpus of 546 billion tokens of LLVM-IR and assembly code and has undergone instruction fine-tuning to interpret compiler behavior. LLM Compiler is released under a bespoke commercial license to allow wide reuse and is available in two sizes: 7 billion and 13 billion parameters. We also present fine-tuned versions of the model, demonstrating its enhanced capabilities in optimizing code size and disassembling from x86_64 and ARM assembly back into LLVM-IR. These achieve 77% of the optimising potential of an autotuning search, and 45% disassembly round trip (14% exact match). This release aims to provide a scalable, cost-effective foundation for further research and development in compiler optimization by both academic researchers and industry practitioners.
翻译:大型语言模型(LLMs)在各类软件工程与编码任务中展现出卓越能力。然而,其在代码与编译器优化领域的应用仍待深入探索。训练LLMs需要大量资源,包括可观的GPU时数与广泛的数据收集,这往往成为实际应用的障碍。为填补这一空白,我们推出元大型语言模型编译器(LLM Compiler)——一套专为代码优化任务设计的鲁棒、开源、预训练模型。该模型基于Code Llama架构构建,增强了对编译器中间表示(IRs)、汇编语言及优化技术的理解能力。模型在5460亿标记规模的LLVM-IR与汇编代码语料库上进行训练,并通过指令微调以解释编译器行为。LLM Compiler采用定制商业许可协议发布以促进广泛复用,提供70亿与130亿参数两种规模版本。我们还展示了模型的微调版本,其在优化代码规模以及将x86_64和ARM汇编反编译回LLVM-IR方面表现出增强能力:这些版本分别达到自动调优搜索77%的优化潜力,以及45%的往返反编译成功率(其中14%为精确匹配)。本次发布旨在为学术界与工业界的编译器优化研究开发提供可扩展、高性价比的基础平台。