Rethinking Implicit Spatial Representation in Visuomotor Policy Learning

Generative model-based imitation learning has become a widely adopted paradigm for robotic manipulation, where policy performance depends critically on the conditioned visual representations. Although spatial softmax-based representations have been adopted in prior visuomotor policies, their effectiveness and underlying mechanisms remain insufficiently understood. This work rethinks the use of spatial softmax pooling: do such implicit spatial representations provide effective and stable visual features for robotic manipulation? Through systematic studies of different pooling methods in visual encoders, we find that this pooling operation produces compact and stable spatial representations, which outperform feature-value representations, despite using substantially fewer dimensions. Complementary saliency analysis further suggests that these spatial representations guide the encoder to focus more consistently on task-relevant regions. However, this advantage is limited by a representation bottleneck in current visual encoders: repeated downsampling operations weaken fine-grained spatial information before the action-generation module can use it, especially under low-resolution observations. Motivated by these findings, we propose PRISM, a visual encoder that preserves multiscale implicit spatial information through top-down cross-attention fusion. Experiments across multiple tasks and policy backbones show consistent improvements. In particular, on the low-resolution, high-precision ToolHang task, PRISM shows clear gains, improving the average success rate from 5.0% to 13.4% while increasing parameters by only 15.4%. These results support the use of multiscale implicit spatial representations as an effective and efficient design principle for robotic manipulation.

翻译：基于生成模型的模仿学习已成为机器人操作中广泛采用的范式，其中策略性能关键取决于条件化的视觉表征。尽管空间softmax表征已被应用于先前的视运动策略中，但其有效性和潜在机制仍未被充分理解。本研究重新思考了空间softmax池化的使用：这种隐式空间表征是否为机器人操作提供了有效且稳定的视觉特征？通过对视觉编码器中不同池化方法的系统研究，我们发现这种池化操作产生了紧凑且稳定的空间表征，尽管使用的维度显著更少，但其性能优于特征值表征。互补的显著性分析进一步表明，这些空间表征引导编码器更一致地聚焦于任务相关区域。然而，这一优势受到当前视觉编码器中表征瓶颈的限制：在动作生成模块能够利用细粒度空间信息之前，重复的下采样操作削弱了这些信息，尤其是在低分辨率观测下受此影响。受这些发现的启发，我们提出了PRISM，一种通过自上而下的交叉注意力融合保留多尺度隐式空间信息的视觉编码器。在多个任务和策略主干上的实验显示了一致的改进。特别是在低分辨率、高精度的ToolHang任务中，PRISM展现出了明显的提升，平均成功率从5.0%提高到13.4%，同时参数仅增加15.4%。这些结果支持将多尺度隐式空间表征作为机器人操作的一种有效且高效的设计原则。