In modern software development, developers frequently need to understand code behavior at a glance -- whether reviewing pull requests, debugging issues, or navigating unfamiliar codebases. This ability to reason about dynamic program behavior is fundamental to effective software engineering and increasingly supported by Large Language Models (LLMs). However, existing studies on code reasoning focus primarily on isolated code snippets, overlooking the complexity of real-world scenarios involving external API interactions and unfamiliar functions. This gap hinders our understanding of what truly makes code reasoning challenging for LLMs across diverse programming contexts. We present CodeGlance, a multi-dimensional benchmark investigating code reasoning challenges across three realistic scenarios: intrinsic logic reasoning, API interaction reasoning, and unseen function reasoning. Through systematic evaluation of 7 state-of-the-art LLMs, we reveal that unseen function reasoning poses significant challenges especially for smaller models, with Qwen2.5-3b achieving only 6.0\% accuracy on unseen functions compared to 37.5\% on familiar APIs. We identify critical code complexity features -- including execution trace length, API invocation count, and control flow complexity -- that significantly impact code reasoning difficulty across scenarios. We further investigate how common augmentation strategies, including CoT, document retrieval, and code search, can improve reasoning performance, finding that their effectiveness varies substantially depending on whether challenges stem from logical complexity or knowledge gaps. These findings provide actionable guidance for developing more capable code reasoning systems and deploying LLM-based programming assistants in real-world software development.
翻译:在现代软件开发中,开发者经常需要快速理解代码行为——无论是审查拉取请求、调试问题还是浏览不熟悉的代码库。这种对动态程序行为进行推理的能力是有效软件工程的基础,并日益得到大语言模型(LLMs)的支持。然而,现有关于代码推理的研究主要集中于孤立的代码片段,忽略了涉及外部API交互和陌生函数的真实场景复杂性。这一差距阻碍了我们对不同编程环境下真正导致LLMs代码推理困难因素的理解。我们提出CodeGlance,一个多维基准测试,用于研究三种现实场景中的代码推理挑战:内在逻辑推理、API交互推理和陌生函数推理。通过对7个最先进LLMs的系统性评估,我们发现陌生函数推理尤其对较小模型构成显著挑战,其中Qwen2.5-3b在陌生函数上的准确率仅为6.0%,而在熟悉API上达到37.5%。我们识别出关键的代码复杂性特征——包括执行轨迹长度、API调用次数和控制流复杂度——这些特征显著影响不同场景下的代码推理难度。我们进一步研究了常见增强策略(包括思维链、文档检索和代码搜索)如何提升推理性能,发现其有效性在很大程度上取决于挑战是源于逻辑复杂性还是知识缺口。这些发现为开发更强大的代码推理系统以及在真实软件开发中部署基于LLM的编程助手提供了可操作的指导。