Vision-Language Models (VLMs) based on Mixture-of-Experts (MoE) architectures have emerged as a pivotal paradigm in multimodal understanding, offering a powerful framework for integrating visual and linguistic information. However, the increasing complexity and diversity of tasks present significant challenges in coordinating load balancing across heterogeneous visual experts, where optimizing one specialist's performance often compromises others' capabilities. To address task heterogeneity and expert load imbalance, we propose Astrea, a novel multi-expert collaborative VLM architecture based on progressive pre-alignment. Astrea introduces three key innovations: 1) A heterogeneous expert coordination mechanism that integrates four specialized models (detection, segmentation, classification, captioning) into a comprehensive expert matrix covering essential visual comprehension elements; 2) A dynamic knowledge fusion strategy featuring progressive pre-alignment to harmonize experts within the VLM latent space through contrastive learning, complemented by probabilistically activated stochastic residual connections to preserve knowledge continuity; 3) An enhanced optimization framework utilizing momentum contrastive learning for long-range dependency modeling and adaptive weight allocators for real-time expert contribution calibration. Extensive evaluations across 12 benchmark tasks spanning VQA, image captioning, and cross-modal retrieval demonstrate Astrea's superiority over state-of-the-art models, achieving an average performance gain of +4.7\%. This study provides the first empirical demonstration that progressive pre-alignment strategies enable VLMs to overcome task heterogeneity limitations, establishing new methodological foundations for developing general-purpose multimodal agents.
翻译:基于专家混合(MoE)架构的视觉语言模型已成为多模态理解领域的关键范式,为整合视觉与语言信息提供了强大框架。然而,任务复杂性与多样性的日益增长对协调异构视觉专家间的负载平衡提出了严峻挑战——优化某一专家性能常会损害其他专家的能力。为解决任务异构性与专家负载不均衡问题,我们提出Astrea,一种基于渐进式预对齐的新型多专家协作视觉语言模型架构。Astrea引入三项核心创新:1) 异构专家协调机制,将四个专用模型(检测、分割、分类、描述)整合为覆盖视觉理解核心要素的专家矩阵;2) 动态知识融合策略,通过对比学习在潜在空间内协调专家模型的渐进式预对齐方法,辅以概率激活的随机残差连接以保持知识连续性;3) 增强的优化框架,采用动量对比学习建模长程依赖关系,并利用自适应权重分配器实时校准专家贡献。在涵盖视觉问答、图像描述与跨模态检索的12项基准任务上的广泛评估表明,Astrea优于现有最先进模型,平均性能提升达+4.7%。本研究首次通过实证证明渐进式预对齐策略能使视觉语言模型突破任务异构性限制,为开发通用多模态智能体奠定了新的方法论基础。