The advent of Retrieval-Augmented Generation (RAG) has significantly enhanced the ability of Large Language Models (LLMs) to produce factually accurate and up-to-date responses. However, the performance of a RAG system is not determined by a single component but emerges from a complex interplay of modular choices, such as embedding models and retrieval algorithms. This creates a vast and often opaque configuration space, making it challenging for developers to understand performance trade-offs and identify optimal designs. To address this challenge, we present RAGExplorer, a visual analytics system for the systematic comparison and diagnosis of RAG configurations. RAGExplorer guides users through a seamless macro-to-micro analytical workflow. Initially, it empowers developers to survey the performance landscape across numerous configurations, allowing for a high-level understanding of which design choices are most effective. For a deeper analysis, the system enables users to drill down into individual failure cases, investigate how differences in retrieved information contribute to errors, and interactively test hypotheses by manipulating the provided context to observe the resulting impact on the generated answer. We demonstrate the effectiveness of RAGExplorer through detailed case studies and user studies, validating its ability to empower developers in navigating the complex RAG design space. Our code and user guide are publicly available at https://github.com/Thymezzz/RAGExplorer.
翻译:检索增强生成(RAG)的出现显著提升了大型语言模型(LLM)生成事实准确且内容更新的回答的能力。然而,RAG系统的性能并非由单一组件决定,而是源于嵌入模型、检索算法等模块化选择之间复杂的相互作用。这形成了一个庞大且通常不透明的配置空间,使得开发者难以理解性能权衡并确定最优设计方案。为应对这一挑战,我们提出了RAGExplorer——一个用于系统化比较与诊断RAG配置的可视化分析系统。RAGExplorer引导用户完成从宏观到微观的无缝分析流程:首先,它使开发者能够纵览大量配置下的性能全景,从而从高层级理解哪些设计选择最为有效;为进一步深入分析,系统支持用户深入探究个体失败案例,调查检索信息差异如何导致错误,并通过交互式地操纵提供的上下文以观察其对生成答案的影响来验证假设。我们通过详细的案例研究和用户研究证明了RAGExplorer的有效性,验证了其能够助力开发者在复杂的RAG设计空间中高效探索。相关代码及用户指南已公开于 https://github.com/Thymezzz/RAGExplorer。