Retrieval-Augmented Generation (RAG) systems couple large language models with external knowledge, yet most evaluation methods report aggregate scores that reveal whether a pipeline underperforms but not where or why. We introduce RAGXplain, an evaluation framework that translates performance metrics into actionable guidance. RAGXplain structures evaluation around a 'Metric Diamond' connecting user input, retrieved context, generated answer, and (when available) ground truth via six diagnostic dimensions. It uses LLM reasoning to produce natural-language failure-mode explanations and prioritized interventions. Across five QA benchmarks, applying RAGXplain's recommendations in a single human-guided pass consistently improves RAG pipeline performance across multiple metrics. We release RAGXplain as open source to support reproducibility and community adoption.
翻译:摘要:检索增强生成(RAG)系统将大型语言模型与外部知识相结合,然而大多数评估方法仅报告聚合得分,这能揭示管道表现不佳的事实,却无法指明原因或位置。我们提出RAGXplain——一个将性能指标转化为可操作指导的评估框架。RAGXplain围绕“度量菱形”构建评估体系,通过六个诊断维度连接用户输入、检索上下文、生成答案以及(可用时)真实数据。该框架利用大语言模型推理生成自然语言形式的故障模式解释,并提供优先级的干预建议。在五个问答基准测试中,单次人工指导下应用RAGXplain的建议,可在多项指标上持续提升RAG管道性能。我们以开源形式发布RAGXplain,以支持可复现性和社区应用。