OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

As a typical and practical application of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) techniques have gained extensive attention, particularly in vertical domains where LLMs may lack domain-specific knowledge. In this paper, we introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including (1) a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios; (2) a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47\% acceptance ratio in human evaluations on generated instances; (3) a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline; and (4) robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator. Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets and highlights the performance variations of RAG systems across diverse topics and tasks, revealing significant opportunities for RAG models to improve their capabilities in vertical domains. We open source the code of our benchmark in \href{https://github.com/RUC-NLPIR/OmniEval}{https://github.com/RUC-NLPIR/OmniEval}.

翻译：作为大型语言模型（LLM）的典型实际应用，检索增强生成（RAG）技术已获得广泛关注，尤其在LLM可能缺乏领域知识的垂直领域。本文提出了金融领域的全方位自动RAG评估基准OmniEval。本基准的特点在于其多维评估框架，包括：（1）基于矩阵的RAG场景评估系统，将查询划分为五类任务和16个金融主题，从而实现对多样化查询场景的结构化评估；（2）多维评估数据生成方法，结合基于GPT-4的自动生成与人工标注，在生成实例的人工评估中达到87.47%的接受率；（3）多阶段评估系统，同时对检索性能与生成性能进行评估，实现对RAG流程的全面评估；以及（4）源自基于规则和基于LLM评估指标的稳健评估指标，通过人工标注和监督微调LLM评估器来提升评估的可靠性。实验证明了OmniEval的全面性，其包含广泛的测试数据集，并凸显了RAG系统在不同主题和任务间的性能差异，揭示了RAG模型在垂直领域提升能力的显著机遇。本基准代码已开源于 \href{https://github.com/RUC-NLPIR/OmniEval}{https://github.com/RUC-NLPIR/OmniEval}。

相关内容