As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of real-world coding tasks and developer expectations. To this end, we introduce a multi-language benchmark that evaluates LLM instruction-following capabilities and is extensible to operate on any set of standalone coding problems. Our benchmark evaluates instruction following in two key settings: adherence to pre-defined constraints specified with the initial problem, and the ability to perform refinements based on follow-up instructions. For this paper's analysis, we empirically evaluated our benchmarking pipeline with programming tasks from LiveBench, that are also automatically translated from Python into Java and JavaScript. Our automated benchmark reveals that models exhibit differing levels of performance across multiple dimensions of instruction-following. Our benchmarking pipeline provides a more comprehensive evaluation of code generation models, highlighting their strengths and limitations across languages and generation goals.
翻译:随着大语言模型在代码生成方面的能力日益增强,评估其性能仍是一个复杂且不断演变的挑战。现有基准主要关注功能正确性,忽视了现实世界编码任务的多样性和开发者的期望。为此,我们引入了一个多语言基准,用于评估大语言模型的指令遵循能力,并可扩展至任何独立的编码问题集。我们的基准在两种关键场景下评估指令遵循:对初始问题中预定义约束的遵守,以及基于后续指令执行优化的能力。在本文的分析中,我们使用来自LiveBench的编程任务对基准测试流程进行了实证评估,这些任务也从Python自动翻译为Java和JavaScript。我们的自动化基准测试显示,模型在指令遵循的多个维度上表现出不同水平的性能。该基准测试流程为代码生成模型提供了更全面的评估,突显了其在跨语言和生成目标方面的优势与局限性。