Identifying the origin of data is crucial for data provenance, with applications including data ownership protection, media forensics, and detecting AI-generated content. A standard approach involves embedding-based retrieval techniques that match query data with entries in a reference dataset. However, this method is not robust against benign and malicious edits. To address this, we propose Data Retrieval with Error-corrected codes and Watermarking (DREW). DREW randomly clusters the reference dataset, injects unique error-controlled watermark keys into each cluster, and uses these keys at query time to identify the appropriate cluster for a given sample. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches. The integration of error control codes (ECC) ensures reliable cluster assignments, enabling the method to perform retrieval on the entire dataset in case the ECC algorithm cannot detect the correct cluster with high confidence. This makes DREW maintain baseline performance, while also providing opportunities for performance improvements due to the increased likelihood of correctly matching queries to their origin when performing retrieval on a smaller subset of the dataset. Depending on the watermark technique used, DREW can provide substantial improvements in retrieval accuracy (up to 40\% for some datasets and modification types) across multiple datasets and state-of-the-art embedding models (e.g., DinoV2, CLIP), making our method a promising solution for secure and reliable source identification. The code is available at https://github.com/mehrdadsaberi/DREW
翻译:数据来源识别对于数据溯源至关重要,其应用涵盖数据所有权保护、媒体取证和AI生成内容检测等领域。标准方法通常采用基于嵌入的检索技术,将查询数据与参考数据集中的条目进行匹配。然而,该方法对良性编辑和恶意篡改缺乏鲁棒性。为此,我们提出基于纠错码与水印的数据检索方法(DREW)。该方法随机对参考数据集进行聚类,向每个聚类注入唯一的误差控制水印密钥,并在查询时利用这些密钥为给定样本确定对应聚类。定位相关聚类后,在聚类内部通过嵌入向量相似性检索寻找最精确的匹配。纠错码(ECC)的集成确保了聚类分配的可靠性,当ECC算法无法高置信度检测正确聚类时,该方法仍支持在全数据集上执行检索。这使得DREW在保持基线性能的同时,由于在更小的数据子集上进行检索时能提高查询与源数据正确匹配的概率,从而获得性能提升空间。根据所采用的水印技术,DREW在多个数据集和最先进的嵌入模型(如DinoV2、CLIP)上能显著提升检索准确率(部分数据集和修改类型的提升幅度高达40%),这使我们的方法成为安全可靠来源识别的潜在解决方案。代码已发布于 https://github.com/mehrdadsaberi/DREW