Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions. To fill this gap, we propose InfiBench, the first large-scale freeform question-answering (QA) benchmark for code to our knowledge, comprising 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. InfiBench uses four types of model-free automatic metrics to evaluate response correctness where domain experts carefully concretize the criterion for each question. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings. Our detailed analyses showcase potential directions for further advancement of code LLMs. InfiBench is fully open source at https://infi-coder.github.io/infibench and continuously expanding to foster more scientific and systematic practices for code LLM evaluation.
翻译:代码大语言模型近年来取得了巨大进展。随着代码大语言模型的快速发展,涌现出许多流行的评估基准,如HumanEval、DS-1000和MBPP,这些基准主要用于衡量代码大语言模型在代码生成任务上的性能。然而,它们不足以全面覆盖代码大语言模型的预期能力范围,这些能力不仅限于代码生成,还包括回答多样化的编程相关问题。为填补这一空白,我们提出了InfiBench——据我们所知首个面向代码的大规模自由形式问答基准。该基准包含234个精心筛选的高质量Stack Overflow问题,涵盖15种编程语言。InfiBench采用四种类型的无模型自动指标来评估回答的正确性,其中每个问题的评判标准均由领域专家仔细具体化。我们对超过100个最新的代码大语言模型进行了系统评估,并得出了一系列新颖且富有洞察力的发现。我们的详细分析展示了代码大语言模型未来发展的潜在方向。InfiBench已在https://infi-coder.github.io/infibench 完全开源,并将持续扩展以促进代码大语言模型评估更科学、更系统的实践。