Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) - the first comprehensive benchmark designed to help diagnose CMTs under diverse noisy context conditions within retrieval-augmented generation (RAG). With this benchmark, we conduct the most extensive evaluation to date of seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to 11 LMs. Our findings expose critical gaps in current CMT evaluation practices, demonstrating the need for holistic testing. We reveal that most existing CMTs struggle to handle the full spectrum of context types encountered in real-world RAG scenarios. We also find that many CMTs display inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples.
翻译:引入外部知识对于知识密集型任务(如问答和事实核查)至关重要。然而,语言模型可能会忽略与过时参数化记忆相矛盾的相关信息,或被不相关的上下文干扰。虽然近期已提出许多上下文利用操控技术来缓解这些问题,但鲜有研究对其进行系统比较。本文开发了CUB(上下文利用基准)——首个旨在诊断检索增强生成中多种噪声上下文条件下上下文利用操控技术的综合基准。借助该基准,我们对七种代表主要上下文利用操控技术类别的最新方法,在三个不同数据集和任务中,应用于11个语言模型进行了迄今最广泛的评估。我们的研究结果揭示了当前上下文利用操控技术评估实践中的关键缺陷,表明进行整体测试的必要性。我们发现在真实检索增强生成场景中,大多数现有上下文利用操控技术难以处理全部上下文类型。同时,与含有自然样本的更真实数据集相比,许多上下文利用操控技术在简单合成数据集上表现出虚高的性能。