Robotic imitation learning typically requires models that capture multimodal action distributions while operating at real-time control rates and accommodating multiple sensing modalities. Although recent generative approaches such as diffusion models, flow matching, and Implicit Maximum Likelihood Estimation (IMLE) have achieved promising results, they often satisfy only a subset of these requirements. To address this, we introduce PRISM, a single-pass policy based on a batch-global rejection-sampling variant of IMLE. PRISM couples a temporal multisensory encoder (integrating RGB, depth, tactile, audio, and proprioception) with a linear-attention generator using a Performer architecture. We demonstrate the efficacy of PRISM on a diverse real-world hardware suite, including loco-manipulation using a Unitree Go2 with a 7-DoF arm D1 and tabletop manipulation with a UR5 manipulator. Across challenging physical tasks such as pre-manipulation parking, high-precision insertion, and multi-object pick-and-place, PRISM outperforms state-of-the-art diffusion policies by 10-25% in success rate while maintaining high-frequency (30-50 Hz) closed-loop control. We further validate our approach on large-scale simulation benchmarks, including CALVIN, MetaWorld, and Robomimic. In CALVIN (10% data split), PRISM improves success rates by approximately 25% over diffusion and approximately 20% over flow matching, while simultaneously reducing trajectory jerk by 20x-50x. These results position PRISM as a fast, accurate, and multisensory imitation policy that retains multimodal action coverage without the latency of iterative sampling.
翻译:机器人模仿学习通常需要模型能够捕捉多模态动作分布,同时以实时控制速率运行并适应多种传感模态。尽管最近的生成方法,如扩散模型、流匹配和隐式最大似然估计(IMLE),已取得有希望的结果,但它们通常只能满足这些要求的一个子集。为解决此问题,我们引入了PRISM,一种基于IMLE的批次全局拒绝采样变体的单次策略。PRISM使用Performer架构,将时间多感官编码器(集成RGB、深度、触觉、音频和本体感觉)与线性注意力生成器耦合。我们在多样化的真实世界硬件套件上证明了PRISM的有效性,包括使用配备7自由度手臂D1的Unitree Go2进行移动操作,以及使用UR5机械臂进行桌面操作。在具有挑战性的物理任务中,如预操作泊车、高精度插入和多对象拾放,PRISM的成功率比最先进的扩散策略高出10-25%,同时保持高频(30-50 Hz)闭环控制。我们进一步在大型仿真基准测试(包括CALVIN、MetaWorld和Robomimic)上验证了我们的方法。在CALVIN(10%数据分割)中,PRISM的成功率比扩散模型提高了约25%,比流匹配提高了约20%,同时将轨迹急动度降低了20-50倍。这些结果表明,PRISM是一种快速、准确且多感官的模仿策略,它保留了多模态动作覆盖范围,而没有迭代采样的延迟。