ComPile: A Large IR Dataset from Production Sources

Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write code with conversational agents like OpenAI's ChatGPT, Google's Bard, or Anthropic's Claude, the way we translate code from one language into another, but also the compiler infrastructure underlying the language. While modeling approaches may vary and representations differ, the targeted tasks often remain the same within the individual classes of models. Relying solely on the ability of modern models to extract information from unstructured code does not take advantage of 70 years of programming language and compiler development by not utilizing the structure inherent to programs in the data collection. This detracts from the performance of models working over a tokenized representation of input code and precludes the use of these models in the compiler itself. To work towards the first intermediate representation (IR) based models, we fully utilize the LLVM compiler infrastructure, shared by a number of languages, to generate a 182B token dataset of LLVM IR. We generated this dataset from programming languages built on the shared LLVM infrastructure, including Rust, Swift, Julia, and C/C++, by hooking into LLVM code generation either through the language's package manager or the compiler directly to extract the dataset of intermediate representations from production grade programs. Statistical analysis proves the utility of our dataset not only for large language model training, but also for the introspection into the code generation process itself with the dataset showing great promise for machine-learned compiler components.

翻译：代码正日益成为现代机器学习研究的核心数据模态，这不仅影响着我们使用对话式AI代理（如OpenAI的ChatGPT、Google的Bard或Anthropic的Claude）编写代码的方式、将代码从一种语言翻译成另一种语言的方法，还影响着语言底层的编译器基础设施。尽管建模方法可能各不相同，表示形式也存在差异，但在单个模型类别内部，目标任务往往保持一致。仅依赖现代模型从非结构化代码中提取信息的能力，未能利用程序本身在数据收集中固有的结构，从而错失了编程语言和编译器发展70年来的成果。这既降低了在处理输入代码的标记化表示时模型的性能，也妨碍了这些模型在编译器本身中的应用。为了迈向首个基于中间表示（IR）的模型，我们充分利用了多种语言共用的LLVM编译器基础设施，生成了一个包含182B个token的LLVM IR数据集。我们通过利用语言包管理器或直接通过编译器挂钩到LLVM代码生成过程，从基于共用LLVM基础设施的编程语言（包括Rust、Swift、Julia和C/C++）中提取生产级程序的中间表示数据集。统计分析证明，我们的数据集不仅对大型语言模型训练具有实用价值，而且对代码生成过程本身的内部机制研究也具有意义，该数据集展示了在机器学习辅助的编译器组件方面的巨大潜力。