To evaluate code large language models (LLMs), research has relied on a few small manually curated benchmarks, such as HumanEval and MBPP, which represent a narrow part of the real-world software domains. In this work, we introduce round-trip correctness (RTC) as an alternative evaluation method. RTC allows Code LLM evaluation on a broader spectrum of real-world software domains without the need for costly human curation. RTC rests on the idea that we can ask a model to make a prediction (e.g., describe some code using natural language), feed that prediction back (e.g., synthesize code from the predicted description), and check if this round-trip leads to code that is semantically equivalent to the original input. We show how to employ RTC to evaluate code synthesis and editing. We find that RTC strongly correlates with model performance on existing narrow-domain code synthesis benchmarks while allowing us to expand to a much broader set of domains and tasks which was not previously possible without costly human annotations.
翻译:为评估代码大型语言模型(LLMs),现有研究依赖于少量手动构建的小型基准测试,如HumanEval和MBPP,这些基准仅覆盖现实世界软件领域的狭窄范围。本研究提出一种替代评估方法——往返正确性(RTC)。RTC无需昂贵的人工标注,即可在更广泛的真实软件领域中对代码LLM进行评估。其核心理念是:要求模型做出预测(例如用自然语言描述某段代码),再将预测结果反馈给模型(例如根据预测描述生成代码),并验证该往返过程是否产生与原始输入语义等价的代码。我们展示了如何运用RTC评估代码生成与编辑任务。研究发现,RTC与模型在现有窄域代码合成基准上的性能高度相关,同时使我们能够在更广泛的领域和任务中展开评估,而此前这些评估若无昂贵的人工标注则无法实现。