Existing federated learning (FL) studies usually assume the training label space and test label space are identical. However, in real-world applications, this assumption is too ideal to be true. A new user could come up with queries that involve data from unseen classes, and such open-vocabulary queries would directly defect such FL systems. Therefore, in this work, we explicitly focus on the under-explored open-vocabulary challenge in FL. That is, for a new user, the global server shall understand her/his query that involves arbitrary unknown classes. To address this problem, we leverage the pre-trained vision-language models (VLMs). In particular, we present a novel adaptation framework tailored for VLMs in the context of FL, named as Federated Multimodal Prototyping (Fed-MP). Fed-MP adaptively aggregates the local model weights based on light-weight client residuals, and makes predictions based on a novel multimodal prototyping mechanism. Fed-MP exploits the knowledge learned from the seen classes, and robustifies the adapted VLM to unseen categories. Our empirical evaluation on various datasets validates the effectiveness of Fed-MP.
翻译:现有联邦学习研究通常假设训练标签空间与测试标签空间一致。然而在实际应用中,这种假设过于理想化而难以成立。新用户可能提出涉及未见类别数据的查询,此类开放词汇查询将直接导致联邦系统失效。为此,本研究聚焦于联邦学习中尚未被充分探索的开放词汇挑战:当新用户提出涉及任意未知类别的查询时,全局服务器需理解其查询意图。针对该问题,我们利用预训练视觉-语言模型,提出一种面向联邦学习场景的新型适配框架——联邦多模态原型方法(Fed-MP)。Fed-MP基于轻量级客户端残差自适应聚合本地模型权重,并通过创新的多模态原型机制进行预测。该方法充分挖掘已知类别的知识,增强适配后视觉-语言模型对未知类别的鲁棒性。在多个数据集上的实验评估验证了Fed-MP的有效性。