Semi-supervised Multimodal Representation Learning through a Global Workspace

Recent deep learning models can efficiently combine inputs from different modalities (e.g., images and text) and learn to align their latent representations, or to translate signals from one domain to another (as in image captioning, or text-to-image generation). However, current approaches mainly rely on brute-force supervised training over large multimodal datasets. In contrast, humans (and other animals) can learn useful multimodal representations from only sparse experience with matched cross-modal data. Here we evaluate the capabilities of a neural network architecture inspired by the cognitive notion of a "Global Workspace": a shared representation for two (or more) input modalities. Each modality is processed by a specialized system (pretrained on unimodal data, and subsequently frozen). The corresponding latent representations are then encoded to and decoded from a single shared workspace. Importantly, this architecture is amenable to self-supervised training via cycle-consistency: encoding-decoding sequences should approximate the identity function. For various pairings of vision-language modalities and across two datasets of varying complexity, we show that such an architecture can be trained to align and translate between two modalities with very little need for matched data (from 4 to 7 times less than a fully supervised approach). The global workspace representation can be used advantageously for downstream classification tasks and for robust transfer learning. Ablation studies reveal that both the shared workspace and the self-supervised cycle-consistency training are critical to the system's performance.

翻译：近期的深度学习模型能够高效融合来自不同模态（如图像与文本）的输入，并学习对齐其潜在表示，或实现跨域信号转换（如图像描述或文本生成图像）。然而，当前方法主要依赖于大规模多模态数据集上的强力监督训练。相比之下，人类（及其他动物）仅需通过稀疏的跨模态匹配数据经验即可学习有效的多模态表示。本文评估了一种受认知科学中"全局工作空间"概念启发的神经网络架构：该架构为两种（或多种）输入模态构建共享表示空间。每种模态由专用系统处理（基于单模态数据预训练后固定参数），对应的潜在表示通过编码-解码机制与单一共享工作空间进行交互。值得注意的是，该架构可通过循环一致性实现自监督训练：编码-解码序列应近似恒等映射。通过在多种视觉-语言模态组合及两个不同复杂度数据集上的实验，我们证明该架构仅需极少量匹配数据（相较全监督方法减少4至7倍）即可训练实现跨模态对齐与转换。全局工作空间表示可有效应用于下游分类任务及鲁棒迁移学习。消融实验表明，共享工作空间与自监督循环一致性训练对系统性能均具有关键作用。