Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-based approaches are gaining increasing attention due to their stronger contextual understanding capabilities. However, they generally evaluate the generated code based on static prompts, and tend to fail for complex code scenarios which typically involve multiple requirements and require more contextual information. In addition, these approaches lack fine-grained evaluation for complex code, resulting in limited explainability. To mitigate the limitations, we propose CodeVisionary, the first agent-based evaluation framework for complex code generation. CodeVisionary consists of two stages: (1) Requirement-guided multi-dimensional context distillation stage and (2) Fine-grained scoring and summarization stage. A comprehensive evaluation report is also generated for enhanced explainability. For validation, we construct a new benchmark consisting of 363 samples spanning 37 coding scenarios and 23 programming languages. Extensive experiments demonstrate that CodeVisionary achieves the best performance among three baselines for evaluating complex code generation, outperforming the best baseline with average improvements of 0.217, 0.163, and 0.141 in Pearson, Spearman, and Kendall-Tau coefficients, respectively. The resources of CodeVisionary are available at https://github.com/Eshe0922/CodeVisionary.
翻译:大语言模型在代码生成方面展现出强大能力,这凸显了进行严格且全面评估的迫切需求。现有评估方法主要分为三类:以人为中心的评估、基于指标的评估以及基于大语言模型的评估。考虑到以人为中心的评估方法劳动密集度高,而基于指标的评估方法又过度依赖参考答案,基于大语言模型的评估方法因其更强的上下文理解能力而日益受到关注。然而,现有方法通常基于静态提示词对生成的代码进行评估,在面对涉及多重需求、需要更多上下文信息的复杂代码场景时往往失效。此外,这些方法缺乏对复杂代码的细粒度评估,导致可解释性有限。为缓解这些局限性,我们提出了CodeVisionary,首个基于智能体的复杂代码生成评估框架。CodeVisionary包含两个阶段:(1) 需求引导的多维上下文提炼阶段;(2) 细粒度评分与总结阶段。该框架同时生成全面的评估报告以增强可解释性。为进行验证,我们构建了一个包含37种编码场景、23种编程语言共计363个样本的新基准。大量实验表明,CodeVisionary在评估复杂代码生成时,在三个基线方法中取得了最佳性能,其皮尔逊、斯皮尔曼和肯德尔-陶系数分别比最佳基线平均提升了0.217、0.163和0.141。CodeVisionary的相关资源已发布于 https://github.com/Eshe0922/CodeVisionary。