Understanding and reasoning about code semantics is essential for enhancing code LLMs' abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or educational coding problems and focus on coarse-grained reasoning tasks such as input/output prediction, limiting their effectiveness in evaluating LLMs in practical SE contexts. To bridge this gap, we propose CodeSense, the first benchmark that makes available a spectrum of fine-grained code reasoning tasks concerned with the software engineering of real-world code. We collected Python, C and Java software projects from real-world repositories. We executed tests from these repositories, collected their execution traces, and constructed a ground truth dataset for fine-grained semantic reasoning tasks. We then performed comprehensive evaluations on state-of-the-art LLMs. Our results show a clear performance gap for the models to handle fine-grained reasoning tasks. Although prompting techniques such as chain-of-thought and in-context learning helped, the lack of code semantics in LLMs fundamentally limits models' capabilities of code reasoning. Besides dataset, benchmark and evaluation, our work produced an execution tracing framework and tool set that make it easy to collect ground truth for fine-grained SE reasoning tasks, offering a strong basis for future benchmark construction and model post training. Our code and data are located at https://codesense-bench.github.io/.
翻译:理解与推理代码语义对于提升代码大语言模型解决现实软件工程任务的能力至关重要。尽管已有若干代码推理基准,但多数依赖于合成数据集或教育性编程问题,且侧重于输入/输出预测等粗粒度推理任务,限制了其在评估大语言模型于实际软件工程场景中的有效性。为弥补这一差距,我们提出了CodeSense,这是首个提供一系列涉及真实世界代码软件工程的细粒度代码推理任务的基准。我们从真实世界代码仓库中收集了Python、C和Java软件项目,执行了这些仓库中的测试,收集其执行轨迹,并构建了用于细粒度语义推理任务的基础真值数据集。随后,我们对最先进的大语言模型进行了全面评估。结果表明,模型在处理细粒度推理任务时存在明显的性能差距。尽管思维链和上下文学习等提示技术有所帮助,但大语言模型中代码语义的缺乏从根本上限制了其代码推理能力。除数据集、基准和评估外,我们的工作还开发了一个执行轨迹追踪框架与工具集,便于为细粒度软件工程推理任务收集基础真值,为未来基准构建与模型后训练提供了坚实基础。我们的代码与数据位于 https://codesense-bench.github.io/。