Retrieval-augmented Generation (RAG) systems have been actively studied and deployed across various industries to query on domain-specific knowledge base. However, evaluating these systems presents unique challenges due to the scarcity of domain-specific queries and corresponding ground truths, as well as a lack of systematic approaches to diagnosing the cause of failure cases -- whether they stem from knowledge deficits or issues related to system robustness. To address these challenges, we introduce GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation framework comprising two key elements: 1) a data generation process that leverages relational databases and LLMs to efficiently produce scalable query-answer pairs for evaluation. This method facilitates the separation of query logic from linguistic variations, enabling the testing of hypotheses related to non-robust textual forms; and 2) an evaluation framework that differentiates knowledge gaps from robustness and enables the identification of defective modules. Our empirical results underscore the limitations of current reference-free evaluation approaches and the reliability of GRAMMAR to accurately identify model vulnerabilities. For implementation details, refer to our GitHub repository: https://github.com/xinzhel/grammar.
翻译:检索增强生成(RAG)系统已在各行业中被积极研究和部署,用于查询特定领域的知识库。然而,评估此类系统面临独特挑战,主要源于领域特定查询及其对应真实标注的稀缺性,以及缺乏系统化方法来诊断故障案例的根本原因——即问题究竟源于知识缺陷还是系统鲁棒性相关的问题。为应对这些挑战,我们提出了GRAMMAR(面向RAG的模块化评估方法),该评估框架包含两个核心组成部分:1)一种数据生成流程,通过利用关系型数据库和大语言模型高效生成可扩展的评估用查询-答案对。该方法实现了查询逻辑与语言表达变体的分离,从而能够检验与非鲁棒性文本形式相关的假设;2)一套能够区分知识缺失与鲁棒性问题、并支持缺陷模块定位的评估框架。我们的实证结果揭示了当前无参考评估方法的局限性,同时验证了GRAMMAR在精准识别模型漏洞方面的可靠性。具体实现细节请参见我们的GitHub代码库:https://github.com/xinzhel/grammar。