Large Language Model (LLM) coding agents typically explore codebases through repeated file-reading and grep-searching, consuming thousands of tokens per query without structural understanding. We present Codebase-Memory, an open-source system that constructs a persistent, Tree-Sitter-based knowledge graph via the Model Context Protocol (MCP), parsing 66 languages through a multi-phase pipeline with parallel worker pools, call-graph traversal, impact analysis, and community discovery. Evaluated across 31 real-world repositories, Codebase-Memory achieves 83% answer quality versus 92% for a file-exploration agent, at ten times fewer tokens and 2.1 times fewer tool calls. For graph-native queries such as hub detection and caller ranking, it matches or exceeds the explorer on 19 of 31 languages.
翻译:大型语言模型(LLM)编码代理通常通过反复读取文件和grep搜索来探索代码库,每次查询消耗数千token却缺乏结构理解。我们提出Codebase-Memory,一个基于Model Context Protocol (MCP)构建持久化、Tree-Sitter型知识图谱的开源系统,通过多阶段流水线(含并行工作池、调用图遍历、影响分析和社区发现)解析66种编程语言。在31个真实仓库上的评估显示,Codebase-Memory以文件探索代理1/10的token消耗和2.1倍的工具调用次数,达到83%的答案质量(对比92%)。对于中心节点检测与调用者排名等图原生查询,它在31种语言中的19种上达到或超越探索代理的表现。