To evaluate code large language models (LLMs), research has relied on a few small manually curated benchmarks, such as HumanEval and MBPP, which represent a narrow part of the real-world software domains. In this work, we introduce round-trip correctness (RTC) as an alternative evaluation method. RTC allows Code LLM evaluation on a broader spectrum of real-world software domains without the need for costly human curation. RTC rests on the idea that we can ask a model to make a prediction (e.g., describe some code using natural language), feed that prediction back (e.g., synthesize code from the predicted description), and check if this round-trip leads to code that is semantically equivalent to the original input. We show how to employ RTC to evaluate code synthesis and editing. We find that RTC strongly correlates with model performance on existing narrow-domain code synthesis benchmarks while allowing us to expand to a much broader set of domains and tasks which was not previously possible without costly human annotations.
翻译:为评估代码大语言模型(LLMs),现有研究主要依赖少数人工精心构建的小型基准测试集,如HumanEval和MBPP,这些数据集仅代表了现实世界软件领域中一个狭窄的范畴。在本工作中,我们引入往返正确性作为替代性评估方法。RTC使得能够在更广泛的实际软件领域中对代码LLMs进行评估,而无需昂贵的人工标注工作。RTC基于以下核心思想:我们可以要求模型进行预测(例如,用自然语言描述某段代码),将该预测反馈给模型(例如,根据预测的描述重新合成代码),并检验此往返过程生成的代码在语义上是否与原始输入等价。我们展示了如何运用RTC来评估代码合成与编辑任务。研究发现,RTC与模型在现有窄领域代码合成基准上的表现高度相关,同时使我们能够扩展到更广泛的领域和任务集合——这在过去若没有昂贵的人工标注是难以实现的。