Retrieval-augmented Generation (RAG) systems have been actively studied and deployed across various industries to query on domain-specific knowledge base. However, evaluating these systems presents unique challenges due to the scarcity of domain-specific queries and corresponding ground truths, as well as a lack of systematic approaches to diagnosing the cause of failure cases -- whether they stem from knowledge deficits or issues related to system robustness. To address these challenges, we introduce GRAMMAR (GRounded And Modular Methodology for Assessment of RAG), an evaluation framework comprising two key elements: 1) a data generation process that leverages relational databases and LLMs to efficiently produce scalable query-answer pairs. This method facilitates the separation of query logic from linguistic variations for enhanced debugging capabilities; and 2) an evaluation framework that differentiates knowledge gaps from robustness and enables the identification of defective modules. Our empirical results underscore the limitations of current reference-free evaluation approaches and the reliability of GRAMMAR to accurately identify model vulnerabilities.
翻译:检索增强生成(RAG)系统已在各行业广泛研究和部署,用于对领域特定知识库进行查询。然而,由于领域特定查询及其对应真实答案的稀缺性,以及缺乏系统性方法来诊断失败案例的根源——无论是源于知识缺陷还是系统鲁棒性问题,评估此类系统面临独特挑战。为解决这些问题,我们提出GRAMMAR(检索增强生成的扎根与模块化评估方法)框架,其包含两个关键要素:1)一种利用关系数据库和大语言模型高效生成可扩展查询-答案对的数据生成流程。该方法通过分离查询逻辑与语言变体来增强调试能力;2)一个区分知识缺口与鲁棒性问题的评估框架,支持缺陷模块的识别。实证结果揭示了当前无参考评估方法的局限性,同时验证了GRAMMAR在准确识别模型脆弱性方面的可靠性。