Astrea: A MOE-based Visual Understanding Model with Progressive Alignment

Xiaoda Yang,JunYu Lu,Hongshun Qiu,Sijing Li,Hao Li,Shengpeng Ji,Xudong Tang,Jiayang Xu,Jiaqi Duan,Ziyue Jiang,Cong Lin,Sihang Cai,Zejian Xie,Zhuoyang Song,Songxin Zhang

Vision-Language Models (VLMs) based on Mixture-of-Experts (MoE) architectures have emerged as a pivotal paradigm in multimodal understanding, offering a powerful framework for integrating visual and linguistic information. However, the increasing complexity and diversity of tasks present significant challenges in coordinating load balancing across heterogeneous visual experts, where optimizing one specialist's performance often compromises others' capabilities. To address task heterogeneity and expert load imbalance, we propose Astrea, a novel multi-expert collaborative VLM architecture based on progressive pre-alignment. Astrea introduces three key innovations: 1) A heterogeneous expert coordination mechanism that integrates four specialized models (detection, segmentation, classification, captioning) into a comprehensive expert matrix covering essential visual comprehension elements; 2) A dynamic knowledge fusion strategy featuring progressive pre-alignment to harmonize experts within the VLM latent space through contrastive learning, complemented by probabilistically activated stochastic residual connections to preserve knowledge continuity; 3) An enhanced optimization framework utilizing momentum contrastive learning for long-range dependency modeling and adaptive weight allocators for real-time expert contribution calibration. Extensive evaluations across 12 benchmark tasks spanning VQA, image captioning, and cross-modal retrieval demonstrate Astrea's superiority over state-of-the-art models, achieving an average performance gain of +4.7\%. This study provides the first empirical demonstration that progressive pre-alignment strategies enable VLMs to overcome task heterogeneity limitations, establishing new methodological foundations for developing general-purpose multimodal agents.

翻译：基于专家混合（MoE）架构的视觉语言模型已成为多模态理解领域的关键范式，为整合视觉与语言信息提供了强大框架。然而，任务复杂性与多样性的日益增长对协调异构视觉专家间的负载平衡提出了严峻挑战——优化某一专家性能常会损害其他专家的能力。为解决任务异构性与专家负载不均衡问题，我们提出Astrea，一种基于渐进式预对齐的新型多专家协作视觉语言模型架构。Astrea引入三项核心创新：1) 异构专家协调机制，将四个专用模型（检测、分割、分类、描述）整合为覆盖视觉理解核心要素的专家矩阵；2) 动态知识融合策略，通过对比学习在潜在空间内协调专家模型的渐进式预对齐方法，辅以概率激活的随机残差连接以保持知识连续性；3) 增强的优化框架，采用动量对比学习建模长程依赖关系，并利用自适应权重分配器实时校准专家贡献。在涵盖视觉问答、图像描述与跨模态检索的12项基准任务上的广泛评估表明，Astrea优于现有最先进模型，平均性能提升达+4.7%。本研究首次通过实证证明渐进式预对齐策略能使视觉语言模型突破任务异构性限制，为开发通用多模态智能体奠定了新的方法论基础。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日