Learning with missing modalities is a fundamental challenge in multimodal robot learning, as real-world robotic systems often operate in environments with incomplete sensor data. Attention-based models are appealing for processing multimodal data because they can handle multiple modalities with a single backbone network. However, most multimodal models assume that all modalities are available during both training and inference, limiting their applicability in robotic perception and decision-making. In this paper, we introduce a multimodal model designed to handle missing modalities during both training and inference. The model is formulated as a conditional variational autoencoder (CVAE) and incorporates a transformer-based architecture that leverages attention mechanisms to learn a unified, fixed-dimensional representation, even when some modalities are missing. We show that our proposed model can be trained with missing modalities while approximating a robust representation of all modalities. We evaluate our approach on five multimodal datasets across two robot learning tasks: human trajectory prediction and robot manipulation forecasting. Experimental results demonstrate that our model effectively learns from incomplete data and is superior to prior multimodal fusion approaches.
翻译:缺失模态学习是多模态机器人学习中的根本挑战,因为真实世界的机器人系统常在传感器数据不完整的环境中运行。基于注意力的模型因其能通过单一骨干网络处理多种模态而备受青睐。然而,多数多模态模型假设训练和推理阶段所有模态均可用,这限制了它们在机器人感知与决策中的适用性。本文提出一种能够在训练和推理阶段处理缺失模态的多模态模型。该模型采用条件变分自编码器(CVAE)架构,并融合基于Transformer的结构,利用注意力机制学习统一的固定维度表征,即使部分模态缺失也能实现。研究表明,所提模型可在缺失模态条件下进行训练,同时近似学习所有模态的鲁棒表征。我们在两个机器人学习任务(人类轨迹预测与机器人操作预测)的五个多模态数据集上评估了该方法。实验结果表明,该模型能有效从不完整数据中学习,且性能优于现有多种模态融合方法。