Watermark radioactivity testing type of methods can detect whether a model was trained on watermarked documents, and have become key tools for protecting data ownership in the fine-tuning of large language models (LLMs). Existing works have proved their effectiveness in centralized LLM fine-tuning. However, this type of method faces several challenges and remains underexplored in federated learning (FL), a widely-applied paradigm for fine-tuning LLMs collaboratively on private data across different users. FL mainly ensures privacy through secure aggregation (SA), which allows the server to aggregate updates while keeping clients' updates private. This mechanism preserves privacy but makes it difficult to identify which client trained on watermarked documents. In this work, we propose FedAttr, a new client-level attribution protocol for FL. FedAttr identifies which clients trained on watermarked data via a paired-subset-difference mechanism, while preserving the privacy guarantees of SA and FL performance. FedAttr proceeds in three steps: (i) estimate each client's update by differencing two SA queries, (ii) score the estimate with the watermark detector via differential scoring, and (iii) combine scores across rounds via Stouffer method. We theoretically show that FedAttr produces an unbiased estimator of each client's update with bounded mutual information leakage (i.e., $O(d^*/N)$ per-round update). Moreover, FedAttr empirically achieves 100% TPR and 0% FPR, outperforming all baselines by at least 44.4% in TPR or 19.1% in FPR, with only 6.3% overhead relative to FL training time. Ablation studies confirm that FedAttr is robust to protocol parameters and configurations.
翻译:水印放射性检测类方法能够检测模型是否在带有水印的文档上训练,并已成为保护大语言模型微调中数据所有权的关键工具。现有研究已证明其在集中式大语言模型微调中的有效性。然而,此类方法在联邦学习(一种跨不同用户私有数据协作微调大语言模型的广泛应用的范式)中仍面临多重挑战且研究不足。联邦学习主要通过安全聚合实现隐私保护——允许服务器聚合更新而保持客户端更新的私密性。这种机制虽保障了隐私,但导致难以识别哪些客户端曾用水印文档训练。本文提出FedAttr,一种面向联邦学习的客户端级归属新协议。FedAttr通过配对子集差分机制识别哪些客户端在带水印数据上训练,同时保持安全聚合的隐私保障及联邦学习性能。该协议分三步实施:(i)通过差分两个安全聚合查询估算各客户端更新,(ii)通过差分评分法将估算结果与水印检测器评分,(iii)通过Stouffer方法跨训练轮次整合评分。理论证明FedAttr能生成每个客户端更新的无偏估计量,且互信息泄露有界(即每轮更新$O(d^*/N)$)。此外,实验表明FedAttr可实现100%真阳率与0%假阳率,在所有基线方法中真阳率或假阳率至少提升44.4%或19.1%,相对于联邦学习训练时间仅产生6.3%的额外开销。消融研究证实FedAttr对协议参数与配置具有鲁棒性。