We present a comprehensive study of answer quality evaluation in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive "thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This approach suits factual business settings where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two Large Language Models (LLMs), evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4's assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights the potential of LLMs as reliable evaluators in closed-domain, closed-ended settings, particularly when human evaluations require significant resources.
翻译:我们通过vRAG-Eval这一新颖的评分系统,对检索增强生成(RAG)应用中答案质量评估进行了全面研究。该系统旨在评估答案的正确性、完整性和真实性。我们进一步将上述质量维度的评分映射为二元分数,对应接受或拒绝的判定,这模拟了聊天应用中常见的直观"点赞"或"点踩"手势。该方法适用于需要明确决策意见的事实性商业场景。我们将vRAG-Eval应用于两个大型语言模型(LLMs),评估基础RAG应用生成答案的质量。通过将此类评估与人类专家判断进行对比,发现GPT-4的评估结果与人类专家判断高度吻合,在接收/拒绝决策上达到83%的一致性。本研究揭示了LLM在封闭领域、封闭式场景中作为可靠评估工具的潜力,特别是在人类评估需要大量资源的情况下。