Retrieval-augmented Generation (RAG) systems have been actively studied and deployed across various industries to query on domain-specific knowledge base. However, evaluating these systems presents unique challenges due to the scarcity of domain-specific queries and corresponding ground truths, as well as a lack of systematic approaches to diagnosing the cause of failure cases -- whether they stem from knowledge deficits or issues related to system robustness. To address these challenges, we introduce GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation framework comprising two key elements: 1) a data generation process that leverages relational databases and LLMs to efficiently produce scalable query-answer pairs. This method facilitates the separation of query logic from linguistic variations for enhanced debugging capabilities; and 2) an evaluation framework that differentiates knowledge gaps from robustness and enables the identification of defective modules. Our empirical results underscore the limitations of current reference-free evaluation approaches and the reliability of GRAMMAR to accurately identify model vulnerabilities.
翻译:检索增强生成(RAG)系统已在各行业中被广泛研究和部署,用于查询领域特定知识库。然而,由于领域特定查询及其对应真实标注的稀缺性,以及缺乏系统化方法来诊断故障案例的根本原因——究竟是源于知识缺陷还是与系统鲁棒性相关的问题,评估这些系统面临着独特的挑战。为应对这些挑战,我们提出了GRAMMAR(面向RAG的模块化评估方法),该评估框架包含两个关键组成部分:1)一种利用关系数据库和大语言模型高效生成可扩展查询-答案对的数据生成流程。该方法通过将查询逻辑与语言变体分离,增强了调试能力;2)一个能够区分知识缺陷与鲁棒性问题、并支持定位故障模块的评估框架。我们的实证结果揭示了当前无参考评估方法的局限性,并验证了GRAMMAR在准确识别模型脆弱性方面的可靠性。