DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry, such as API usage and code purpose understanding. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement-detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.
翻译:DevBench是一个基于遥测数据的基准测试,旨在评估大型语言模型(LLM)在真实代码补全任务上的性能。该基准包含来自六种编程语言和六个任务类别的1800个评估实例,这些实例源自真实的开发者遥测数据(如API使用和代码意图理解)。与现有基准不同,DevBench强调生态效度,避免训练数据污染,并支持细粒度诊断。评估综合了功能正确性、基于相似度的度量以及聚焦实用性和上下文相关性的LLM评判机制。通过对9个前沿模型的评估,揭示了它们在语法精确性、语义推理和实际效用方面的差异。本基准提供了可操作的见解,以指导模型选择与改进——这些细节在其他基准中常被忽略,但对实际部署和针对性模型开发至关重要。