MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

Previous studies on federated learning (FL) often encounter performance degradation due to data heterogeneity among different clients. In light of the recent advances in multimodal large language models (MLLMs), such as GPT-4v and LLaVA, which demonstrate their exceptional proficiency in multimodal tasks, such as image captioning and multimodal question answering. We introduce a novel federated learning framework, named Multimodal Large Language Model Assisted Federated Learning (MLLM-LLaVA-FL), which employs powerful MLLMs at the server end to address the heterogeneous and long-tailed challenges. Owing to the advanced cross-modality representation capabilities and the extensive open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing the extensive, yet previously underexploited, open-source data accessible from websites and powerful server-side computational resources. Hence, the MLLM-LLaVA-FL not only enhances the performance but also avoids increasing the risk of privacy leakage and the computational burden on local devices, distinguishing it from prior methodologies. Our framework has three key stages. Initially, we conduct global visual-text pretraining of the model. This pretraining is facilitated by utilizing the extensive open-source data available online, with the assistance of MLLMs. Subsequently, the pretrained model is distributed among various clients for local training. Finally, once the locally trained models are transmitted back to the server, a global alignment is carried out under the supervision of MLLMs to further enhance the performance. Experimental evaluations on established benchmarks, show that our framework delivers promising performance in the typical scenarios with data heterogeneity and long-tail distribution across different clients in FL.

翻译：先前关于联邦学习（FL）的研究常因不同客户端间的数据异质性而遭遇性能下降。鉴于多模态大语言模型（MLLMs）——如GPT-4v和LLaVA——的最新进展，这些模型在图像描述、多模态问答等任务中展现出卓越能力。我们提出了一种新颖的联邦学习框架，命名为多模态大语言模型辅助的联邦学习（MLLM-LLaVA-FL），该框架在服务器端利用强大的MLLMs来应对数据异质性和长尾分布的挑战。得益于MLLMs先进的跨模态表征能力及其广泛的开源词汇先验知识，我们的框架能够有效利用从网站获取的、大量但先前未充分开发的开源数据，以及强大的服务器端计算资源。因此，MLLM-LLaVA-FL不仅提升了性能，还避免了增加隐私泄露风险及本地设备的计算负担，这使其有别于先前的方法。我们的框架包含三个关键阶段。首先，我们利用在线可获取的丰富开源数据，在MLLMs的辅助下，对模型进行全局视觉-文本预训练。随后，将预训练模型分发给各客户端进行本地训练。最后，当本地训练完成的模型传回服务器后，在MLLMs的监督下进行全局对齐，以进一步提升性能。在公认基准上的实验评估表明，我们的框架在联邦学习中不同客户端间存在数据异质性和长尾分布的典型场景下，展现出优异的性能。