Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, Code LLMs produce impressive results on programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available. Low resource languages include OCaml, Racket, and several others. This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, MultiPL-T, translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize tests for commented code from a high-resource language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate Python code to a target low-resource language, and use tests to validate the translation. We apply this approach to generate tens of thousands of validated training items for Julia, Lua, OCaml, R, and Racket. Furthermore, we use an open model (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. With MultiPL-T generated data, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket. On established benchmarks (MultiPL-E), these models outperform other open Code LLMs. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.
翻译:在过去几年中,面向代码的大语言模型(Code LLMs)已开始对编程实践产生重要影响。Code LLMs 也正逐渐成为编程语言和软件工程研究的基础构件。然而,Code LLMs 对于在其训练数据中充分表征的编程语言(如 Java、Python 或 JavaScript)表现出色,但在训练数据有限的低资源语言上表现不佳。低资源语言包括 OCaml、Racket 等。本文提出了一种利用半合成数据提升 Code LLMs 在低资源语言上性能的有效方法。我们的方法 MultiPL-T 通过以下方式将高资源语言的训练数据转化为低资源语言的训练数据:1)使用 Code LLM 为高资源语言中带注释的代码合成测试,筛选出有缺陷的测试和测试覆盖率低的代码;2)使用 Code LLM 将 Python 代码翻译为目标低资源语言,并通过测试验证翻译的正确性。我们应用该方法为 Julia、Lua、OCaml、R 和 Racket 生成数万个经过验证的训练样本。此外,我们使用开源模型(StarCoderBase)和开放训练数据(The Stack),从而能够对基准测试进行去污染、在无许可违规条件下训练模型,并执行原本无法进行的实验。利用 MultiPL-T 生成的数据,我们针对 Julia、Lua、OCaml、R 和 Racket 发布了 StarCoderBase 和 Code Llama 的微调版本。在已建立的基准测试(MultiPL-E)上,这些模型优于其他开源 Code LLMs。MultiPL-T 方法易于应用于新语言,且相较于延长训练时间等替代方案,显著更高效、更有效。