LLMs are an integral part of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the quality of end-to-end RAG systems, there is a lack of research on understanding the appropriateness of an LLM for the RAG task. Thus, we introduce a new metric, Trust-Score, that provides a holistic evaluation of the trustworthiness of LLMs in an RAG framework. We show that various prompting methods, such as in-context learning, fail to adapt LLMs effectively to the RAG task. Thus, we propose Trust-Align, a framework to align LLMs for higher Trust-Score. LLaMA-3-8b, aligned with our method, significantly outperforms open-source LLMs of comparable sizes on ASQA (up 10.7), QAMPARI (up 29.2) and ELI5 (up 14.9). We release our code at: https://github.com/declare-lab/trust-align.
翻译:大型语言模型是检索增强生成系统的核心组成部分。尽管许多研究关注于评估端到端RAG系统的质量,但针对大型语言模型是否适合RAG任务的研究仍显不足。为此,我们提出了一种新的评估指标——可信度分数,用于在RAG框架下对大型语言模型的可信度进行整体评估。研究表明,上下文学习等多种提示方法均未能有效使大型语言模型适应RAG任务。因此,我们提出了Trust-Align框架,通过对齐机制提升模型的可信度分数。采用本方法对齐的LLaMA-3-8b模型在ASQA(提升10.7分)、QAMPARI(提升29.2分)和ELI5(提升14.9分)基准测试中显著优于同等规模的开源大型语言模型。代码已发布于:https://github.com/declare-lab/trust-align。