Retrieval-augmented Generation (RAG) systems have been actively studied and deployed across various industries to query on domain-specific knowledge base. However, evaluating these systems presents unique challenges due to the scarcity of domain-specific queries and corresponding ground truths, as well as a lack of systematic approaches to diagnosing the cause of failure cases -- whether they stem from knowledge deficits or issues related to system robustness. To address these challenges, we introduce GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation framework comprising two key elements: 1) a data generation process that leverages relational databases and LLMs to efficiently produce scalable query-answer pairs. This method facilitates the separation of query logic from linguistic variations for enhanced debugging capabilities; and 2) an evaluation framework that differentiates knowledge gaps from robustness and enables the identification of defective modules. Our empirical results underscore the limitations of current reference-free evaluation approaches and the reliability of GRAMMAR to accurately identify model vulnerabilities.
翻译:检索增强生成(RAG)系统已被积极研究并部署于各行业,用于查询领域特定知识库。然而,由于领域特定查询及其对应标准答案的稀缺性,且缺乏系统化方法诊断故障案例的根源——无论是源于知识缺陷还是系统鲁棒性问题,评估这些系统面临独特挑战。为解决上述挑战,我们提出GRAMMAR(针对RAG的扎根模块化评估方法论),该评估框架包含两个核心要素:1)利用关系数据库与LLMs高效生成可扩展查询-答案对的数据生成流程。该方法通过分离查询逻辑与语言变体增强了调试能力;2)区分知识缺口与鲁棒性缺陷的评估框架,并实现缺陷模块的定位。实证结果揭示了当前无参考评估方法的局限性,同时验证了GRAMMAR在准确识别模型脆弱性方面的可靠性。