Understanding Codebase like a Professional! Human-AI Collaboration for Code Comprehension

Understanding an unfamiliar codebase is an essential task for developers in various scenarios, such as during the onboarding process. Especially when the codebase is large and time is limited, achieving a decent level of comprehension remains challenging for both experienced and novice developers, even with the assistance of LLMs. Existing studies have shown that LLMs often fail to support users in understanding code structures or to provide user-centered, adaptive, and dynamic assistance in real-world settings. To address this, we propose learning from the perspective of a unique role, code auditors, whose work often requires them to quickly familiarize themselves with new code projects on a weekly or even daily basis. To achieve this, we recruited and interviewed 8 code auditing practitioners to understand how they master codebase understanding. We identified four design opportunities for an LLM-based codebase understanding system: supporting cognitive alignment through automated codebase information extraction, decomposition, and representation, as well as reducing manual effort and conversational distraction through interaction design. To validate these four design opportunities, we designed a system prototype, CodeMap, that provides dynamic information extraction and representation aligned with the human cognitive flow and enables interactive switching among hierarchical codebase visualizations. We then conducted a user study with nine experienced developers and six novice developers. Our results demonstrate that CodeMap significantly improved users' perceived intuitiveness, ease of use, and usefulness in supporting code comprehension, while reducing their reliance on reading and interpreting LLM responses by 79% and increasing map usage time by 90% compared to the static visualization analysis tool.

翻译：理解陌生代码库是开发者在多种场景下的核心任务，例如入职适应阶段。尤其在代码库规模庞大且时间有限的情况下，即使借助大语言模型（LLMs），无论是经验丰富的开发者还是新手都难以实现充分的理解。现有研究表明，LLMs在实际应用中往往无法有效帮助用户理解代码结构，也难以提供以用户为中心、自适应且动态的辅助。为解决这一问题，我们提出从代码审计师这一独特角色的视角进行学习——他们的工作通常要求其每周甚至每日快速熟悉新的代码项目。为此，我们招募并访谈了8位代码审计从业者，以探究他们掌握代码库理解的方法。我们识别出基于LLM的代码库理解系统的四个设计机遇：通过自动化代码库信息提取、分解与表征来支持认知对齐，以及通过交互设计减少人工负担与对话干扰。为验证这四个设计机遇，我们设计了系统原型CodeMap，该系统提供符合人类认知流程的动态信息提取与表征，并支持在层级化代码库可视化视图间进行交互切换。随后，我们开展了包含九名经验开发者和六名新手开发者的用户研究。实验结果表明：与静态可视化分析工具相比，CodeMap显著提升了用户在支持代码理解方面的感知直观性、易用性和有效性，同时将用户对阅读和解释LLM回复的依赖降低了79%，并将地图使用时长提升了90%。