FedDEAP：面向多领域联邦学习的自适应双提示调优 (FedDEAP: Adaptive Dual-Prompt Tuning for Multi-Domain Federated Learning)

Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP's generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.

翻译：联邦学习（FL）使得多个客户端能够在不暴露本地数据的情况下协作训练机器学习模型，在性能与隐私之间取得平衡。然而，客户端之间的领域偏移和标签异质性常常阻碍聚合后全局模型的泛化能力。近期，像CLIP这样的大规模视觉-语言模型展现了强大的零样本分类能力，这引出了如何在联邦环境下跨领域有效微调CLIP的问题。在本工作中，我们提出了一种自适应联邦提示调优框架FedDEAP，以增强CLIP在多领域场景下的泛化性能。我们的方法包含以下三个关键组成部分：（1）为减轻由标签监督调优导致的领域特定信息损失，我们通过使用具有无偏映射的语义与领域转换网络，解耦图像中的语义特征与领域特定特征；（2）为在全局提示聚合过程中保留领域特定知识，我们引入了双提示设计，包含全局语义提示和局部领域提示，以平衡共享信息与个性化信息；（3）为使生成的文本特征最大程度地包含图像中的语义与领域信息，我们在两个学习到的转换下对齐文本与视觉表示，以保持语义与领域一致性。理论分析及在四个数据集上的大量实验证明了我们的方法在提升CLIP跨多领域联邦图像识别泛化能力方面的有效性。