We study the problem of data selling for Retrieval Augmented Generation (RAG) tasks in Generative AI applications. We model each buyer's valuation of a dataset with a natural coverage-based valuation function that increases with the inclusion of more relevant data points that would enhance responses to anticipated queries. Motivated by issues such as data control and prior-free revenue maximization, we focus on the scenario where each data point can be allocated to only one buyer. We show that the problem of welfare maximization in this setting is NP-hard even with two bidders, but design a polynomial-time $(1-1/e)$ approximation algorithm for any number of bidders. Unfortunately, however, this efficient allocation algorithm fails to be incentive compatible. The crux of our approach is a carefully tailored post-processing step called data burning which retains the $(1-1/e)$ approximation factor but achieves incentive compatibility. Our thorough experiments on synthetic and real-world image and text datasets demonstrate the practical effectiveness of our algorithm compared to popular baseline algorithms for combinatorial auctions.
翻译:本文研究生成式人工智能应用中检索增强生成任务的数据销售问题。我们采用基于覆盖度的自然估值函数对买家数据集的估值进行建模,该函数值随包含更多能提升预期查询响应质量的相关数据点而增加。受数据控制和先验无关收益最大化等问题的驱动,我们聚焦于每个数据点仅能分配给单一买家的场景。研究表明,即使仅有两名竞标者,该场景下的社会福利最大化问题仍属NP难问题,但我们为任意数量竞标者设计了多项式时间的(1-1/e)近似算法。然而遗憾的是,该高效分配算法无法满足激励相容性。我们方法的核心在于精心设计的后处理步骤——数据销毁机制,该机制在保持(1-1/e)近似比的同时实现了激励相容性。通过在合成数据集及真实世界图像与文本数据集上的系统实验,我们验证了所提算法相较于组合拍卖常用基线算法的实际有效性。