Efficient fine-tuning of visual-language models like CLIP has become crucial due to their large-scale parameter size and extensive pretraining requirements. Existing methods typically address either the issue of unseen classes or unseen domains in isolation, without considering a joint framework for both. In this paper, we propose \textbf{Fed}erated Joint Learning for \textbf{D}omain and \textbf{C}lass \textbf{G}eneralization, termed \textbf{FedDCG}, a novel approach that addresses both class and domain generalization in federated learning settings. Our method introduces a domain grouping strategy where class-generalized networks are trained within each group to prevent decision boundary confusion. During inference, we aggregate class-generalized results based on domain similarity, effectively integrating knowledge from both class and domain generalization. Specifically, a learnable network is employed to enhance class generalization capabilities, and a decoupling mechanism separates general and domain-specific knowledge, improving generalization to unseen domains. Extensive experiments across various datasets show that \textbf{FedDCG} outperforms state-of-the-art baselines in terms of accuracy and robustness.
翻译:由于视觉-语言模型(如CLIP)参数量庞大且预训练需求广泛,其高效微调已变得至关重要。现有方法通常单独处理未见类别或未见领域的问题,而未考虑两者的联合框架。本文提出**联邦联合学习用于领域与类别泛化**(简称**FedDCG**),这是一种在联邦学习设置中同时解决类别与领域泛化问题的新方法。我们的方法引入了一种领域分组策略,在每个组内训练类别泛化网络以避免决策边界混淆。在推理阶段,我们基于领域相似性聚合类别泛化结果,有效整合了类别与领域泛化的知识。具体而言,我们采用可学习网络增强类别泛化能力,并通过解耦机制分离通用知识与领域特定知识,从而提升对未见领域的泛化性能。在多个数据集上的大量实验表明,**FedDCG**在准确性与鲁棒性方面均优于当前最先进的基线方法。