Retrieval-Augmented Generation (RAG) enhances the factual grounding of Large Language Models by conditioning their outputs on external documents. However, standard embedding-based retrievers treat naturally structured corpora, such as technical manuals, as flat collections of passages, thereby overlooking the hyperlink topology that users rely on when navigating such content. We introduce LARAG (Link-Aware RAG): a lightweight, link-aware retrieval strategy that leverages the author-defined hyperlink structure already present in HTML documentation, encoding hyperlink relations as metadata in the chunk representations and exploiting them to perform a form of graph-like retrieval of locally relevant content. In a benchmark of twenty expert-designed queries over Rulex Platform technical documentation and four prompting strategies, LARAG consistently improves answer quality, achieving the highest BERTScore F1, while retrieving fewer chunks and generating fewer tokens than a baseline RAG architecture used for comparison. These results show that directly leveraging the existing hyperlink topology of technical documentation, even without explicit graph construction or inference, enables an implicit form of graph-like retrieval that yields a more faithful and efficient RAG pipeline, providing better grounding at lower cost.
翻译:检索增强生成(Retrieval-Augmented Generation, RAG)通过将大型语言模型的输出条件化于外部文档,增强了其事实依据。然而,标准的基于嵌入的检索器将技术手册等自然结构化的语料库视为扁平的段落集合,从而忽略了用户在浏览此类内容时所依赖的超链接拓扑结构。我们提出了LARAG(Link-Aware RAG):一种轻量级、链接感知的检索策略,它利用HTML文档中已有的作者定义的超链接结构,将超链接关系编码为块的元数据,并利用这些关系执行一种局部相关内容的类图检索。在Rulex平台技术文档上的二十个专家设计查询和四种提示策略的基准测试中,LARAG持续提高了答案质量,取得了最高的BERTScore F1值,同时相比基线RAG架构,检索了更少的块并生成了更少的词元。这些结果表明,直接利用技术文档中现有的超链接拓扑结构,即使没有显式的图构建或推理,也能实现一种隐式的类图检索,从而产生更忠实、更高效的RAG流水线,以更低的成本提供更好的依据。