Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities

Robotic systems perceive the world through multiple input modalities -- including visual camera streams and natural language instructions -- and must select appropriate actions based on these signals. However, assuming the permanent availability of all input devices is unrealistic, as sensors may fail, become occluded, or drop out entirely during deployment. Robust handling of such missing-modality scenarios is therefore essential for real-world robot operation. This paper introduces RL4IL, a reinforcement learning guided method for imitation learning that selects the most suitable action for a given observation by identifying the most relevant expert demonstrations from a training library. A reinforcement learning policy, trained via Proximal Policy Optimisation over Breadth-First Search candidate sets, ranks candidate demonstrations and a soft cross-attention fusion head aggregates their action signals to produce the final prediction. When a modality is missing at inference time, a dedicated per-modality RL retrieval policy identifies donor demonstrations from the training library, and a soft imputation head reconstructs the missing embedding via cross-attention over the top-ranked donors -- without requiring any retraining of the system. Experiments on three LIBERO benchmark suites demonstrate that RL4IL substantially outperforms state-of-the-art imitation learning methods under sensor dropout conditions, while requiring no policy network training. The code can be found at https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera

翻译：机器人系统通过多种输入模态感知世界——包括视觉摄像头流和自然语言指令——并需根据这些信号选择恰当的动作。然而，假设所有输入设备永久可用是不切实际的，因为传感器在部署过程中可能发生故障、被遮挡或完全丢失信号。因此，鲁棒处理此类模态缺失场景对于机器人在真实环境中的运行至关重要。本文提出RL4IL方法，这是一种基于强化学习的模仿学习引导方法，通过从训练库中识别最相关的专家演示，为给定观测选择最合适的动作。采用基于广度优先搜索候选集的近端策略优化训练的强化学习策略，对候选演示进行排序，并通过软交叉注意力融合头聚合其动作信号以生成最终预测。当推理时某模态缺失时，专用模态的强化学习检索策略从训练库中识别捐赠演示，并通过软插补头对排名最高的捐赠者执行交叉注意力操作重建缺失嵌入——无需对系统进行任何重新训练。在三个LIBERO基准套件上的实验表明，RL4IL在传感器丢失条件下显著优于当前最先进的模仿学习方法，且无需训练策略网络。代码见https://github.com/h-ismkhan/Reinforcement-Learning-via-kNN-for-Robotic-Learning-with-Missing-Camera