Retrieval-augmented generation (RAG) helps address the limitations of the parametric knowledge embedded within a language model (LM). However, investigations of how LMs utilise retrieved information of varying complexity in real-world scenarios have been limited to synthetic contexts. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complex and diverse real-world context settings. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.
翻译:检索增强生成(RAG)有助于解决语言模型(LM)内部参数化知识的局限性。然而,现有关于语言模型在现实场景中如何利用不同复杂度检索信息的研究,大多局限于合成上下文。我们提出了DRUID(检索不可靠、不充分及难以理解上下文的真实数据集),该数据集包含真实世界查询及经人工标注立场的上下文。该数据集基于自动化声明验证这一典型任务构建,而对此任务而言,真实世界证据的自动检索至关重要。通过将DRUID与合成数据集(CounterFact、ConflictQA)进行对比,我们发现人工数据集往往无法体现真实世界上下文环境的复杂性与多样性。研究表明,合成数据集放大了真实检索数据中罕见的上下文特征,这导致通过我们提出的新颖ACU指标衡量的上下文利用结果被高估。此外,尽管先前研究主要关注单一上下文特征来解释上下文利用,但DRUID数据集中单一上下文属性与ACU的相关性,相较于与上下文来源相关的其他属性而言出人意料地微弱。总体而言,我们的工作强调需要开展与真实世界对齐的上下文利用研究,以真实反映并提升现实RAG场景中的性能表现。