To extend the scope of coding queries to more realistic settings, we propose ODEX, the first Open-Domain EXecution-based natural language (NL) to Python code generation dataset. ODEX has 945 NL-Code pairs spanning 79 diverse libraries, along with 1,707 human-written test cases for execution. Our NL-Code pairs are harvested from StackOverflow forums to encourage natural and practical coding queries. Moreover, ODEX supports four natural languages as intents, in English, Spanish, Japanese, and Russian. ODEX unveils intriguing behavioral differences among top-performing code language models (LM). While CODEX achieves better overall results, CODEGEN improves effectively via scaling -- CODEGEN 6.1B performs comparably with CODEX 12B. Both models show substantial gaps between open and closed domains, but CODEGEN gaps tend to decrease with model size while CODEX gaps increase. We release ODEX to facilitate research into open-domain problems for the code generation community.
翻译:为将编程查询的范围拓展至更真实的场景,我们提出ODEX——首个面向开放领域、基于执行的从自然语言(NL)到Python代码生成的评测数据集。ODEX包含涵盖79个不同库的945对自然语言-代码样本,以及1,707个人工编写的用于执行的测试用例。我们的自然语言-代码样本来自StackOverflow论坛,以鼓励自然且实用的编程查询。此外,ODEX支持四种自然语言作为意图表达:英语、西班牙语、日语和俄语。ODEX揭示了顶级代码语言模型(LM)之间有趣的行为差异:虽然CODEX在整体结果上表现更优,但CODEGEN通过模型扩展实现了有效提升——CODEGEN 6.1B与CODEX 12B性能相当。两个模型在开放域与封闭域之间均存在显著差距,但CODEGEN的差距随模型规模增大而缩小,而CODEX的差距却随之增大。我们发布ODEX以促进代码生成社区对开放域问题的研究。