This paper introduces a novel Multi-Agent Cooperative Learning (MACL) framework to address cross-modal alignment collapse in vision-language models when handling out-of-distribution (OOD) concepts. Four core agents, including image, text, name, and coordination agents, collaboratively mitigate modality imbalance through structured message passing. The proposed framework enables multi-agent feature space name learning, incorporates a context exchange enhanced few-shot learning algorithm, and adopts an adaptive dynamic balancing mechanism to regulate inter-agent contributions. Experiments on the VISTA-Beyond dataset demonstrate that MACL significantly improves performance in both few-shot and zero-shot settings, achieving 1-5% precision gains across diverse visual domains.
翻译:本文提出了一种新颖的多智能体协同学习框架,以解决视觉-语言模型在处理分布外概念时面临的跨模态对齐崩溃问题。该框架包含图像、文本、命名与协调四个核心智能体,通过结构化的消息传递协同缓解模态不平衡。所提出的框架支持多智能体特征空间命名学习,融合了上下文交换增强的小样本学习算法,并采用自适应动态平衡机制来调节智能体间的贡献。在VISTA-Beyond数据集上的实验表明,MACL在少样本与零样本设置下均显著提升了性能,在多种视觉领域中实现了1-5%的精度提升。