Large Language Models have demonstrated exceptional proficiency on coding tasks, but it is challenging to precisely evaluate their code reasoning ability. Existing benchmarks are insufficient as they are unrealistic and conflate semantic reasoning ability with performance on software engineering tasks. We introduce CRQBench, a benchmark of 100 C++ code reasoning questions and answers derived from contextualized code review comments. To curate CRQBench, we use an LLM assistant alongside human inspection, reducing manual effort. We conduct an evaluation of GPT-4 on CRQBench and find that it produces correct responses grounded in the given context for 65 of the 100 questions.
翻译:大型语言模型在编码任务中展现出卓越能力,但精确评估其代码推理能力仍具挑战性。现有基准测试因脱离实际场景且混淆语义推理能力与软件工程任务表现而存在不足。本文提出CRQBench——一个包含100道C++代码推理问答的基准测试集,其内容源自情境化代码审查注释。为构建CRQBench,我们采用LLM辅助工具结合人工审查机制,显著降低了人工工作量。通过对GPT-4在CRQBench上的评估,发现其在100道题目中能基于给定上下文生成65道题目的正确解答。