We present a comprehensive study of answer quality evaluation in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive "thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This approach suits factual business contexts where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two Large Language Models (LLMs), evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4's assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights the potential of LLMs as reliable evaluators in closed-domain, closed-ended settings, particularly when human evaluations require significant resources.
翻译:我们提出了一项关于使用vRAG-Eval评估检索增强生成(RAG)应用中答案质量的综合性研究。vRAG-Eval是一种新颖的评分系统,旨在评估答案的正确性、完整性和真实性。我们进一步将上述质量维度的评分映射为二元分数,对应接受或拒绝的判定,这模拟了聊天应用中常用的直观“点赞”或“点踩”手势。该方法适用于需要明确决策意见的事实性商业场景。我们的评估将vRAG-Eval应用于两个大型语言模型(LLMs),以评估基础RAG应用生成答案的质量。我们将这些评估结果与人类专家判断进行比较,发现GPT-4的评估与人类专家的评估高度一致,在接收或拒绝决策上达到了83%的一致性。本研究突显了LLMs在封闭领域、封闭式场景中作为可靠评估者的潜力,特别是在人类评估需要大量资源的情况下。