Centralized multimodal learning commonly compresses language, acoustic, and visual signals into a single fused representation for prediction. While effective, this paradigm suffers from two limitations: modality dominance, where optimization gravitates towards the path of least resistance, ignoring weaker but informative modalities, and spurious modality coupling, where models overfit to incidental cross-modal correlations. To address these, we propose Group Cognition Learning (GCL), a governed collaboration paradigm that applies a two-stage protocol after modality-specific encoding. In Stage 1 (Selective Interaction), a Routing Agent proposes directed interaction routes, and an Auditing Agent assigns sample-wise gates to emphasize exchanges that yield positive marginal predictive gain while suppressing redundant coupling. In Stage 2 (Consensus Formation), a Public-Factor Agent maintains an explicit shared factor, and an Aggregation Agent produces the final prediction through contribution-aware weighting while keeping each modality representation as a specialization channel. Extensive experiments on CMU-MOSI, CMU-MOSEI, and MIntRec demonstrate that GCL mitigates dominance and coupling, establishing state-of-the-art results across both regression and classification benchmarks. Analysis experiments further demonstrate the effectiveness of the design.
翻译:中心化多模态学习通常将语言、声学及视觉信号压缩为单一融合表示进行预测。尽管有效,该范式存在两大局限:模态主导现象——优化过程倾向于选择阻力最小的路径,忽略信号较弱但信息量丰富的模态;以及虚假模态耦合——模型过度拟合跨模态的偶然相关性。为解决上述问题,我们提出群体认知学习(GCL),这是一种受控协作范式,在模态特定编码后采用两阶段协议。第一阶段(选择性交互)中,路由智能体提出定向交互路径,审计智能体分配样本级门控以强化能产生正向边际预测增益的信息交换,同时抑制冗余耦合。第二阶段(共识形成)中,公共因子智能体维护显式共享因子,聚合智能体通过贡献感知加权生成最终预测,同时保留各模态表示作为专用通道。在CMU-MOSI、CMU-MOSEI及MIntRec数据集上的大量实验表明,GCL有效缓解了模态主导与虚假耦合问题,在回归与分类基准测试中均取得最优结果。消融实验进一步验证了该设计的有效性。