To reduce issues like hallucinations and lack of control in Large Language Models (LLMs), a common method is to generate responses by grounding on external contexts given as input, known as knowledge-augmented models. However, previous research often narrowly defines "grounding" as just having the correct answer, which does not ensure the reliability of the entire response. To overcome this, we propose a stricter definition of grounding: a model is truly grounded if it (1) fully utilizes the necessary knowledge from the provided context, and (2) stays within the limits of that knowledge. We introduce a new dataset and a grounding metric to evaluate model capability under the definition. We perform experiments across 25 LLMs of different sizes and training methods and provide insights into factors that influence grounding performance. Our findings contribute to a better understanding of how to improve grounding capabilities and suggest an area of improvement toward more reliable and controllable LLM applications.
翻译:为减少大型语言模型(LLMs)中的幻觉与不可控问题,常见方法是通过锚定输入的外部上下文生成响应,即知识增强模型。然而,先前研究常将“锚定”狭义定义为仅包含正确答案,这无法保证整体响应的可靠性。为此,我们提出更严格的锚定定义:若模型(1)充分利用所提供上下文中必要的知识,且(2)严格限定在该知识范围内,则视为真正实现锚定。我们引入新数据集与锚定度量指标,以评估该定义下的模型能力。通过对25种不同规模与训练方法的LLMs进行实验,我们揭示了影响锚定性能的关键因素。本研究深化了对提升锚定能力的理解,并为构建更可靠、可控的LLM应用指明了改进方向。