Humans perceive the world through multiple senses, enabling them to create a comprehensive representation of their surroundings and to generalize information across domains. For instance, when a textual description of a scene is given, humans can mentally visualize it. In fields like robotics and Reinforcement Learning (RL), agents can also access information about the environment through multiple sensors; yet redundancy and complementarity between sensors is difficult to exploit as a source of robustness (e.g. against sensor failure) or generalization (e.g. transfer across domains). Prior research demonstrated that a robust and flexible multimodal representation can be efficiently constructed based on the cognitive science notion of a 'Global Workspace': a unique representation trained to combine information across modalities, and to broadcast its signal back to each modality. Here, we explore whether such a brain-inspired multimodal representation could be advantageous for RL agents. First, we train a 'Global Workspace' to exploit information collected about the environment via two input modalities (a visual input, or an attribute vector representing the state of the agent and/or its environment). Then, we train a RL agent policy using this frozen Global Workspace. In two distinct environments and tasks, our results reveal the model's ability to perform zero-shot cross-modal transfer between input modalities, i.e. to apply to image inputs a policy previously trained on attribute vectors (and vice-versa), without additional training or fine-tuning. Variants and ablations of the full Global Workspace (including a CLIP-like multimodal representation trained via contrastive learning) did not display the same generalization abilities.
翻译:人类通过多种感官感知世界,从而形成对环境的全面表征并实现跨领域信息泛化。例如,当获得场景的文字描述时,人类能够在脑海中想象出对应的视觉画面。在机器人与强化学习等领域中,智能体也可通过多传感器获取环境信息,但传感器之间的冗余性与互补性难以被有效利用以提升鲁棒性(如应对传感器故障)或泛化能力(如跨领域迁移)。先前研究证明,基于认知科学中"全局工作空间"的概念——一种经过训练以融合多模态信息并将信号回传至各模态的独特表征——能够高效构建鲁棒且灵活的多模态表征。本研究探索这种类脑多模态表征对强化学习智能体的潜在优势。首先,我们训练一个"全局工作空间"模型,使其利用两种输入模态(视觉输入或表征智能体与/或环境状态的属性向量)采集的环境信息;随后基于冻结的全局工作空间训练强化学习策略。在两个不同环境与任务中,实验结果表明该模型具备输入模态间的零样本跨模态迁移能力——即无需额外训练或微调,即可将基于属性向量训练的策略直接应用于图像输入(反之亦然)。而完整全局工作空间模型的变体与消融实验(包括基于对比学习训练的类CLIP多模态表征)并未展现出相同的泛化能力。