Retrieval-augmented Generation (RAG) systems have been actively studied and deployed across various industries to query on domain-specific knowledge base. However, evaluating these systems presents unique challenges due to the scarcity of domain-specific queries and corresponding ground truths, as well as a lack of systematic approaches to diagnosing the cause of failure cases -- whether they stem from knowledge deficits or issues related to system robustness. To address these challenges, we introduce GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation framework comprising two key elements: 1) a data generation process that leverages relational databases and LLMs to efficiently produce scalable query-answer pairs. This method facilitates the separation of query logic from linguistic variations for enhanced debugging capabilities; and 2) an evaluation framework that differentiates knowledge gaps from robustness and enables the identification of defective modules. Our empirical results underscore the limitations of current reference-free evaluation approaches and the reliability of GRAMMAR to accurately identify model vulnerabilities.
翻译:检索增强生成(RAG)系统已在各行业被广泛研究和部署,用于查询领域特定知识库。然而,此类系统的评估面临独特挑战:领域特定查询及其对应真实数据的稀缺性,以及缺乏系统性方法诊断失败案例的根源——无论是源于知识缺陷还是系统鲁棒性问题。针对这些挑战,我们提出GRAMMAR(GRounded And Modular Methodology for Assessment of RAG)评估框架,包含两个核心要素:1)利用关系数据库和大语言模型高效生成可扩展查询-答案对的数据生成流程。该方法将查询逻辑与语言变体分离,增强调试能力;2)区分知识空缺与鲁棒性缺陷的评估框架,支持定位故障模块。实验结果表明,当前无参考评估方法存在局限性,而GRAMMAR能可靠识别模型薄弱环节。