Robust multimodal systems must remain effective when some modalities are noisy, degraded, or unreliable. Existing multimodal fusion methods often learn modality selection jointly with representation learning, making it difficult to determine whether robustness comes from the selector itself or from full end-to-end co-adaptation. Motivated by Global Workspace Theory (GWT), we study this question using a lightweight top-down modality selector operating on top of a frozen multimodal global workspace. We evaluate our method on two multimodal datasets of increasing complexity: Simple Shapes and MM-IMDb 1.0, under structured modality corruptions. The selector improves robustness while using far fewer trainable parameters than end-to-end attention baselines, and the learned selection strategy transfers better across downstream tasks, corruption regimes, and even to a previously unseen modality. Beyond explicit corruption settings, on the MM-IMDb 1.0 benchmark, we show that the same mechanism improves the global workspace over its no-attention counterpart and yields decent benchmark performance.
翻译:鲁棒多模态系统必须能够在部分模态存在噪声、退化或不可靠时依然保持有效性。现有的大多数多模态融合方法常将模态选择与表征学习联合优化,这使得我们难以确定系统的鲁棒性是源自选择器本身,还是源自端到端完全协同适应的结果。受全局工作空间理论(GWT)启发,我们通过在冻结的多模态全局工作空间之上叠加轻量级自上而下的模态选择器,对这一问题进行了研究。我们在两种复杂度递增的多模态数据集(Simple Shapes 和 MM-IMDb 1.0)上,在结构化模态退化条件下评估了该方法。与端到端注意基线相比,该选择器在显著减少可训练参数的同时提升了鲁棒性,且其习得的选择策略在下游任务、退化模式乃至先前未见模态上的迁移能力更优。除了显式退化场景,在 MM-IMDb 1.0 基准测试中,我们证明了同一机制能改善无注意基准下的全局工作空间性能,并取得了可观的基准性能。