ComPile: A Large IR Dataset from Production Sources

Code is increasingly becoming a core data modality of modern machine learning research impacting not only the way we write code with conversational agents like OpenAI's ChatGPT, Google's Bard, or Anthropic's Claude, the way we translate code from one language into another, but also the compiler infrastructure underlying the language. While modeling approaches may vary and representations differ, the targeted tasks often remain the same within the individual classes of models. Relying solely on the ability of modern models to extract information from unstructured code does not take advantage of 70 years of programming language and compiler development by not utilizing the structure inherent to programs in the data collection. This detracts from the performance of models working over a tokenized representation of input code and precludes the use of these models in the compiler itself. To work towards the first intermediate representation (IR) based models, we fully utilize the LLVM compiler infrastructure, shared by a number of languages, to generate a 182B token dataset of LLVM IR. We generated this dataset from programming languages built on the shared LLVM infrastructure, including Rust, Swift, Julia, and C/C++, by hooking into LLVM code generation either through the language's package manager or the compiler directly to extract the dataset of intermediate representations from production grade programs. Statistical analysis proves the utility of our dataset not only for large language model training, but also for the introspection into the code generation process itself with the dataset showing great promise for machine-learned compiler components.

翻译：代码正日益成为现代机器学习研究的核心数据模态，不仅影响着我们通过对话式AI（如OpenAI的ChatGPT、Google的Bard或Anthropic的Claude）编写代码的方式、将代码从一种语言翻译至另一种语言的方式，还影响着语言底层的编译器基础设施。尽管建模方法可能各异、表示形式有所不同，但在各类模型内部，目标任务往往保持一致。若仅依赖现代模型从非结构化代码中提取信息的能力，而不利用数据收集中程序固有的结构，则无法发挥编程语言与编译器70年发展的优势。这不仅损害了基于输入代码分词化表示的模型性能，也阻碍了这些模型在编译器本身中的应用。为迈向首个基于中间表示（IR）的模型，我们充分利用了多种语言共享的LLVM编译器基础设施，生成了一个包含182B令牌的LLVM IR数据集。我们通过钩入LLVM代码生成流程（经由语言的包管理器或直接通过编译器），从构建于共享LLVM基础设施上的编程语言（包括Rust、Swift、Julia和C/C++）中提取生产级程序的中间表示数据集。统计分析证明，该数据集不仅适用于大规模语言模型训练，还适用于对代码生成过程本身的内部剖析，展现出用于机器学习编译器组件的巨大潜力。