Edge-cloud synergies provide a promising paradigm for privacy-preserving deployment of foundation models, where lightweight on-device models adapt to domain-specific data and cloud-hosted models coordinate knowledge sharing. However, in real-world edge environments, collaborative multimodal learning is challenged by modality heterogeneity (different modality combinations across domains) and model-structure heterogeneity (different modality-specific encoders/fusion modules. To address these issues, we propose ML-ECS, a collaborative multimodal learning framework that enables joint training between a server-based model and heterogeneous edge models. This framework consists of four components: (1) cross-modal contrastive learning (CCL) to align modality representations in a shared latent space, (2) adaptive multimodal tuning (AMT) to preserve domain-specific knowledge from local datasets, (3) modality-aware model aggregation (MMA) to robustly aggregate while mitigating noise caused by missing modalities, and (4) SLM-enhanced CCL (SE-CCL) to facilitate bidirectional knowledge transfer between cloud and edge. Experimental results on various multimodal tasks show that \pname consistently outperform state-of-the-art baselines under varying modality availability, achieving improvements of 5.44% to 12.08% in Rouge-LSum and improving both client- and server-side performance. In addition, by communicating only low-rank LoRA parameters and fused representations, ML-ECS achieves high communication efficiency, requiring only 0.65% of the total parameter volume.
翻译:边缘-云协同为隐私保护的基础模型部署提供了一种前景广阔的范式,其中轻量级的设备端模型适应领域特定数据,而云端托管的模型协调知识共享。然而,在实际的边缘环境中,协作式多模态学习面临着模态异构性(不同领域间模态组合的差异)和模型结构异构性(不同模态特定编码器/融合模块的差异)的挑战。为解决这些问题,我们提出了ML-ECS,一个支持基于服务器的模型与异构边缘模型之间进行联合训练的协作式多模态学习框架。该框架包含四个组成部分:(1) 跨模态对比学习(CCL),用于在共享潜在空间中对齐模态表示;(2) 自适应多模态调优(AMT),用于保留来自本地数据集的领域特定知识;(3) 模态感知模型聚合(MMA),用于在减轻因模态缺失所引入噪声的同时进行鲁棒的聚合;(4) SLM增强的CCL(SE-CCL),用于促进云与边缘之间的双向知识迁移。在多种多模态任务上的实验结果表明,\pname 在不同模态可用性条件下始终优于最先进的基线方法,在Rouge-LSum指标上实现了5.44%至12.08%的提升,并同时改善了客户端和服务器端的性能。此外,通过仅通信低秩LoRA参数和融合后的表示,ML-ECS实现了较高的通信效率,仅需传输总参数量0.65%的数据。