A Reality Check on Context Utilisation for Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) helps address the limitations of the parametric knowledge embedded within a language model (LM). However, investigations of how LMs utilise retrieved information of varying complexity in real-world scenarios have been limited to synthetic contexts. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complex and diverse real-world context settings. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.

翻译：检索增强生成（RAG）有助于解决语言模型（LM）内部参数化知识的局限性。然而，现有关于语言模型在现实场景中如何利用不同复杂度检索信息的研究，大多局限于合成上下文。我们提出了DRUID（检索不可靠、不充分及难以理解上下文的真实数据集），该数据集包含真实世界查询及经人工标注立场的上下文。该数据集基于自动化声明验证这一典型任务构建，而对此任务而言，真实世界证据的自动检索至关重要。通过将DRUID与合成数据集（CounterFact、ConflictQA）进行对比，我们发现人工数据集往往无法体现真实世界上下文环境的复杂性与多样性。研究表明，合成数据集放大了真实检索数据中罕见的上下文特征，这导致通过我们提出的新颖ACU指标衡量的上下文利用结果被高估。此外，尽管先前研究主要关注单一上下文特征来解释上下文利用，但DRUID数据集中单一上下文属性与ACU的相关性，相较于与上下文来源相关的其他属性而言出人意料地微弱。总体而言，我们的工作强调需要开展与真实世界对齐的上下文利用研究，以真实反映并提升现实RAG场景中的性能表现。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日