Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, we present the BabelCode framework for execution-based evaluation of any benchmark in any language. BabelCode enables new investigations into the qualitative performance of models' memory, runtime, and individual test case results. Additionally, we present a new code translation dataset called Translating Python Programming Puzzles (TP3) from the Python Programming Puzzles (Schuster et al. 2021) benchmark that involves translating expert-level python functions to any language. With both BabelCode and the TP3 benchmark, we investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages. Training a model on a balanced corpus results in, on average, 12.34% higher $pass@k$ across all tasks and languages compared to the baseline. We find that this strategy achieves 66.48% better $pass@k$ on low-resource languages at the cost of only a 12.94% decrease to high-resource languages. In our three translation tasks, this strategy yields, on average, 30.77% better low-resource $pass@k$ while having 19.58% worse high-resource $pass@k$.
翻译:当前用于评估神经代码模型的基准测试仅涵盖少量编程语言,排除了许多流行语言如Go或Rust。为解决这一问题,我们提出了BabelCode框架,用于对任何基准测试中的任何语言进行基于执行的评估。BabelCode使研究人员能够深入探究模型在内存、运行时以及单个测试用例结果上的定性性能。此外,我们还从Python Programming Puzzles(Schuster等人,2021)基准测试中开发了一个新的代码翻译数据集,名为Translating Python Programming Puzzles(TP3),该数据集涉及将专家级Python函数翻译成任意语言。借助BabelCode和TP3基准测试,我们研究了在训练数据集中平衡14种语言分布是否能够提升大型语言模型在低资源语言上的性能。在平衡语料库上训练模型后,所有任务和语言的平均$pass@k$比基线高出12.34%。我们发现,该策略在低资源语言上实现了66.48%的$pass@k$提升,仅以高资源语言12.94%的性能下降为代价。在我们的三项翻译任务中,该策略平均使低资源语言的$pass@k$提升30.77%,而高资源语言的$pass@k$则下降19.58%。