Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results, with no view of how the corpus is organized or what it has not yet seen. We present Corpus2Skill, which distills a document corpus offline into a hierarchical skill directory and lets an LLM agent navigate it at serve time, drilling from a bird's-eye view through progressively finer summaries down to documents, and backtracking when a branch is unproductive. On an enterprise customer-support benchmark, Corpus2Skill improves both answer quality and grounding over single-shot dense, hybrid, hierarchical-retrieval, and agentic RAG baselines at a moderate cost tradeoff. A ten-subset generalization study further shows that corpus navigation is not a universal replacement for retrieval: it consistently helps on single-domain corpora with a recoverable topical taxonomy, but flat retrieval remains preferable on open-domain factoid pools or homogeneous-tabular corpora that defeat top-level clustering. We characterize this scope distinction and discuss it as a design guideline for knowledge-grounded systems. Code is available at https://github.com/dukesun99/Corpus2Skill.
翻译:检索增强生成(RAG)将大语言模型的响应锚定于外部证据,但将模型视为搜索结果的被动消费者,使其无法感知语料库的组织结构或尚未获取的信息。我们提出Corpus2Skill框架,该框架离线将文档语料库蒸馏为层级化技能目录,使大语言模型智能体在服务阶段实现导航——从宏观概览逐步深入到逐级细化的摘要直至文档,并在分支无效时回溯。在企业客户支持基准测试中,Corpus2Skill在中等成本权衡下,其答案质量与证据锚定效果均优于单次稠密检索、混合检索、层级检索及智能体式RAG基线方法。一项包含十个子集的泛化研究进一步表明:语料库导航并非检索的普适替代方案——它在具有可恢复主题分类体系的单领域语料库中持续有效,但扁平化检索在开放领域事实型语料池或破坏顶层聚类效果的同类表格型语料中仍具优势。我们明确了这一适用范围边界,并将其作为知识锚定系统的设计准则进行讨论。代码开源地址:https://github.com/dukesun99/Corpus2Skill。