Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace

Humans perceive the world through multiple senses, enabling them to create a comprehensive representation of their surroundings and to generalize information across domains. For instance, when a textual description of a scene is given, humans can mentally visualize it. In fields like robotics and Reinforcement Learning (RL), agents can also access information about the environment through multiple sensors; yet redundancy and complementarity between sensors is difficult to exploit as a source of robustness (e.g. against sensor failure) or generalization (e.g. transfer across domains). Prior research demonstrated that a robust and flexible multimodal representation can be efficiently constructed based on the cognitive science notion of a 'Global Workspace': a unique representation trained to combine information across modalities, and to broadcast its signal back to each modality. Here, we explore whether such a brain-inspired multimodal representation could be advantageous for RL agents. First, we train a 'Global Workspace' to exploit information collected about the environment via two input modalities (a visual input, or an attribute vector representing the state of the agent and/or its environment). Then, we train a RL agent policy using this frozen Global Workspace. In two distinct environments and tasks, our results reveal the model's ability to perform zero-shot cross-modal transfer between input modalities, i.e. to apply to image inputs a policy previously trained on attribute vectors (and vice-versa), without additional training or fine-tuning. Variants and ablations of the full Global Workspace (including a CLIP-like multimodal representation trained via contrastive learning) did not display the same generalization abilities.

翻译：人类通过多种感官感知世界，从而形成对环境的全面表征并实现跨领域信息泛化。例如，当获得场景的文字描述时，人类能够在脑海中想象出对应的视觉画面。在机器人与强化学习等领域中，智能体也可通过多传感器获取环境信息，但传感器之间的冗余性与互补性难以被有效利用以提升鲁棒性（如应对传感器故障）或泛化能力（如跨领域迁移）。先前研究证明，基于认知科学中"全局工作空间"的概念——一种经过训练以融合多模态信息并将信号回传至各模态的独特表征——能够高效构建鲁棒且灵活的多模态表征。本研究探索这种类脑多模态表征对强化学习智能体的潜在优势。首先，我们训练一个"全局工作空间"模型，使其利用两种输入模态（视觉输入或表征智能体与/或环境状态的属性向量）采集的环境信息；随后基于冻结的全局工作空间训练强化学习策略。在两个不同环境与任务中，实验结果表明该模型具备输入模态间的零样本跨模态迁移能力——即无需额外训练或微调，即可将基于属性向量训练的策略直接应用于图像输入（反之亦然）。而完整全局工作空间模型的变体与消融实验（包括基于对比学习训练的类CLIP多模态表征）并未展现出相同的泛化能力。