Ward: Provable RAG Dataset Inference via LLM Watermarks

Retrieval-Augmented Generation (RAG) improves LLMs by enabling them to incorporate external data during generation. This raises concerns for data owners regarding unauthorized use of their content in RAG systems. Despite its importance, the challenge of detecting such unauthorized usage remains underexplored, with existing datasets and methodologies from adjacent fields being ill-suited for its study. In this work, we take several steps to bridge this gap. First, we formalize this problem as (black-box) RAG Dataset Inference (RAG-DI). To facilitate research on this challenge, we further introduce a novel dataset specifically designed for benchmarking RAG-DI methods under realistic conditions, and propose a set of baseline approaches. Building on this foundation, we introduce Ward, a RAG-DI method based on LLM watermarks that enables data owners to obtain rigorous statistical guarantees regarding the usage of their dataset in a RAG system. In our experimental evaluation, we show that Ward consistently outperforms all baselines across many challenging settings, achieving higher accuracy, superior query efficiency and robustness. Our work provides a foundation for future studies of RAG-DI and highlights LLM watermarks as a promising approach to this problem.

翻译：检索增强生成（RAG）通过使大型语言模型能够在生成过程中纳入外部数据，从而提升其性能。这引发了数据所有者对其内容在RAG系统中被未经授权使用的担忧。尽管这一问题至关重要，但检测此类未经授权使用的挑战仍未得到充分探索，现有来自相邻领域的数据集和方法均不适用于其研究。在本工作中，我们采取多项措施以弥合这一差距。首先，我们将该问题形式化为（黑盒）RAG数据集推断（RAG-DI）。为促进针对此挑战的研究，我们进一步引入一个专门设计用于在现实条件下对RAG-DI方法进行基准测试的新颖数据集，并提出一组基线方法。在此基础上，我们提出Ward，一种基于LLM水印的RAG-DI方法，使数据所有者能够获得关于其数据集在RAG系统中使用的严格统计保证。在我们的实验评估中，我们展示了Ward在多种具有挑战性的场景下始终优于所有基线方法，实现了更高的准确性、更优的查询效率和更强的鲁棒性。我们的工作为未来RAG-DI研究奠定了基础，并凸显了LLM水印作为解决此问题的一种有前景的途径。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日