Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results: it never sees how the corpus is organized or what it has not yet retrieved, limiting its ability to backtrack or combine scattered evidence. We present Corpus2Skill, which distills a document corpus into a hierarchical skill directory offline and lets an LLM agent navigate it at serve time. The compilation pipeline iteratively clusters documents, generates LLM-written summaries at each level, and materializes the result as a tree of navigable skill files. At serve time, the agent receives a bird's-eye view of the corpus, drills into topic branches via progressively finer summaries, and retrieves full documents by ID. Because the hierarchy is explicitly visible, the agent can reason about where to look, backtrack from unproductive paths, and combine evidence across branches. On WixQA, an enterprise customer-support benchmark for RAG, Corpus2Skill outperforms dense retrieval, RAPTOR, and agentic RAG baselines across all quality metrics.
翻译:检索增强生成(RAG)将大语言模型(LLM)的响应锚定于外部证据,但模型仅被当作搜索结果的被动消费者:它既不了解语料库的组织结构,也未意识到尚未检索到的内容,从而限制了其回溯或整合分散证据的能力。我们提出Corpus2Skill方法,该方法离线将文档语料库蒸馏为分层的技能目录,并在服务时让LLM智能体对其进行导航。其编译流程通过迭代聚类文档、为每一层级生成LLM编写的摘要,并将结果物化为可导航技能文件的树形结构。在服务时,智能体获得语料库的全景视图,通过逐步细化的摘要深入主题分支,并按ID检索完整文档。由于层级结构的显式可见性,智能体能够推理搜索方向、从无效路径中回溯,并跨分支整合证据。在面向RAG的企业客服基准测试WixQA上,Corpus2Skill在所有质量指标上均优于密集检索、RAPTOR及基于智能体的RAG基线方法。