Vision-language pre-training (VLP) has arised as an efficient scheme for multimodal representation learning, but it requires large-scale multimodal data for pre-training, making it an obstacle especially for medical applications. To overcome the data limitation, federated learning (FL) can be a promising strategy to scale up the dataset for medical VLP while protecting data privacy. However, client data are often heterogeneous in real-world scenarios, and we observe that local training on heterogeneous client data would distort the multimodal representation learning and lead to biased cross-modal alignment. To address this challenge, we propose a Federated Align as IDeal (FedAID) framework for federated VLP with robustness to data heterogeneity, to bind local clients with an ideal crossmodal alignment. Specifically, to reduce distortions on global-aggregated features while learning diverse semantics from client datasets during local training, we propose to bind the cross-model aligned representation space learned by local models with an unbiased one via guidance-based regularization. Moreover, we employ a distribution-based min-max optimization to learn the unbiased cross-modal alignment at each communication turn of federated pre-training. The experiments on real-world datasets demonstrate our method successfully promotes efficient federated multimodal learning for medical VLP with data heterogeneity.
翻译:视觉-语言预训练(VLP)已成为多模态表示学习的高效范式,但其需要大规模多模态数据进行预训练,这在医学应用领域尤为受限。为突破数据壁垒,联邦学习(FL)可作为一种有前景的策略,在保护数据隐私的同时扩展医学VLP的数据规模。然而现实场景中客户端数据常呈现异质性,我们观察到基于异质客户端数据的本地训练会扭曲多模态表示学习,导致跨模态对齐产生偏差。为解决这一挑战,我们提出联邦理想对齐框架(FedAID),该框架通过绑定本地客户端与理想跨模态对齐,实现对数据异质性具有鲁棒性的联邦VLP。具体而言,为在本地训练过程中既能学习客户端数据集的多样语义,又能减少全局聚合特征的失真,我们提出通过基于引导的正则化方法,将本地模型学习的跨模态对齐表示空间与无偏对齐表示空间进行绑定。此外,我们采用基于分布的最小-最大优化方法,在联邦预训练的每个通信轮次中学习无偏的跨模态对齐。基于真实数据集的实验表明,本方法能有效促进数据异质环境下医学VLP的高效联邦多模态学习。