Despite the significant progress made by existing retrieval augmented language models (RALMs) in providing trustworthy responses and grounding in reliable sources, they often overlook effective alignment with human preferences. In the alignment process, reward models (RMs) act as a crucial proxy for human values to guide optimization. However, it remains unclear how to evaluate and select a reliable RM for preference alignment in RALMs. To this end, we propose RAG-RewardBench, the first benchmark for evaluating RMs in RAG settings. First, we design four crucial and challenging RAG-specific scenarios to assess RMs, including multi-hop reasoning, fine-grained citation, appropriate abstain, and conflict robustness. Then, we incorporate 18 RAG subsets, six retrievers, and 24 RALMs to increase the diversity of data sources. Finally, we adopt an LLM-as-a-judge approach to improve preference annotation efficiency and effectiveness, exhibiting a strong correlation with human annotations. Based on the RAG-RewardBench, we conduct a comprehensive evaluation of 45 RMs and uncover their limitations in RAG scenarios. Additionally, we also reveal that existing trained RALMs show almost no improvement in preference alignment, highlighting the need for a shift towards preference-aligned training.We release our benchmark and code publicly at https://huggingface.co/datasets/jinzhuoran/RAG-RewardBench/ for future work.
翻译:尽管现有的检索增强语言模型(RALMs)在提供可信赖的响应和基于可靠来源的论证方面取得了显著进展,但它们往往忽视了与人类偏好的有效对齐。在对齐过程中,奖励模型(RMs)作为人类价值观的关键代理,用于指导优化。然而,如何为RALMs中的偏好对齐评估和选择一个可靠的RM仍不明确。为此,我们提出了RAG-RewardBench,这是首个用于评估RAG设置下RMs的基准。首先,我们设计了四个关键且具有挑战性的RAG特定场景来评估RMs,包括多跳推理、细粒度引用、恰当弃权和冲突鲁棒性。接着,我们整合了18个RAG子集、六种检索器和24个RALMs,以增加数据源的多样性。最后,我们采用LLM-as-a-judge方法来提升偏好标注的效率和效果,该方法与人工标注展现出强相关性。基于RAG-RewardBench,我们对45个RMs进行了全面评估,并揭示了它们在RAG场景中的局限性。此外,我们还发现,现有经过训练的RALMs在偏好对齐方面几乎没有改进,这凸显了转向偏好对齐训练的必要性。我们已在 https://huggingface.co/datasets/jinzhuoran/RAG-RewardBench/ 公开发布了我们的基准和代码,以供未来研究使用。