Programmers are turning to AI coding assistants to answer questions about their code. Benchmarks are needed to soundly evaluate these systems and understand their performance. To enable such a study, we curate a benchmark of real-world contextualized questions derived from Github pull request comments. Out of this work, we present RubberDuckBench: a multilingual benchmark of questions about code, along with detailed rubrics for evaluating answers. We evaluate a diverse set of 20 LLMs (proprietary & open-source) on answering these questions. We find that even state of the art models fail to give consistent, correct responses across the benchmark. Grok 4 (69.29%), Claude Opus 4 (68.5%), and GPT-5 (67.8%) perform best overall, but do not exhibit pairwise significant superiority over the next 9 best performing models. Most models obtain points through partial credit, with the best performing models only answering at most 2 questions completely correctly across all trials. Furthermore, models often hallucinate with lies in 58.3\% of responses on average. Cost analysis reveals no correlation between expense (API pricing or parameter count) and performance. We intend this benchmark to be a target for future research in trustworthy and correct AI coding assistants.
翻译:程序员正越来越多地借助AI编程助手来解答关于其代码的问题。为了可靠地评估这些系统并理解其性能,建立相应的基准测试至关重要。为促成此类研究,我们构建了一个源自GitHub拉取请求评论的真实世界情境化问题基准。基于此项工作,我们提出了RubberDuckBench:一个关于代码的多语言问题基准,并附有用于评估答案的详细评分标准。我们评估了20种不同的LLM(包括专有模型和开源模型)在回答这些问题上的表现。研究发现,即使是最先进的模型也无法在整个基准测试中提供一致且正确的回答。Grok 4(69.29%)、Claude Opus 4(68.5%)和GPT-5(67.8%)整体表现最佳,但相对于紧随其后的9个表现最佳的模型,它们并未展现出成对的显著优势。大多数模型通过部分得分获得积分,表现最佳的模型在所有尝试中也最多仅能完全正确地回答2个问题。此外,模型经常产生幻觉,平均58.3%的回答中包含虚假信息。成本分析表明,费用(API定价或参数量)与性能之间不存在相关性。我们希望此基准测试能成为未来构建可信且正确的AI编程助手研究的目标。