Retrieval-Augmented Generation (RAG) improves LLMs by enabling them to incorporate external data during generation. This raises concerns for data owners regarding unauthorized use of their content in RAG systems. Despite its importance, the challenge of detecting such unauthorized usage remains underexplored, with existing datasets and methodologies from adjacent fields being ill-suited for its study. In this work, we take several steps to bridge this gap. First, we formalize this problem as (black-box) RAG Dataset Inference (RAG-DI). To facilitate research on this challenge, we further introduce a novel dataset specifically designed for benchmarking RAG-DI methods under realistic conditions, and propose a set of baseline approaches. Building on this foundation, we introduce Ward, a RAG-DI method based on LLM watermarks that enables data owners to obtain rigorous statistical guarantees regarding the usage of their dataset in a RAG system. In our experimental evaluation, we show that Ward consistently outperforms all baselines across many challenging settings, achieving higher accuracy, superior query efficiency and robustness. Our work provides a foundation for future studies of RAG-DI and highlights LLM watermarks as a promising approach to this problem.
翻译:检索增强生成(RAG)通过使大型语言模型能够在生成过程中纳入外部数据,从而提升其性能。这引发了数据所有者对其内容在RAG系统中被未经授权使用的担忧。尽管这一问题至关重要,但检测此类未经授权使用的挑战仍未得到充分探索,现有来自相邻领域的数据集和方法均不适用于其研究。在本工作中,我们采取多项措施以弥合这一差距。首先,我们将该问题形式化为(黑盒)RAG数据集推断(RAG-DI)。为促进针对此挑战的研究,我们进一步引入一个专门设计用于在现实条件下对RAG-DI方法进行基准测试的新颖数据集,并提出一组基线方法。在此基础上,我们提出Ward,一种基于LLM水印的RAG-DI方法,使数据所有者能够获得关于其数据集在RAG系统中使用的严格统计保证。在我们的实验评估中,我们展示了Ward在多种具有挑战性的场景下始终优于所有基线方法,实现了更高的准确性、更优的查询效率和更强的鲁棒性。我们的工作为未来RAG-DI研究奠定了基础,并凸显了LLM水印作为解决此问题的一种有前景的途径。