Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present $\textbf{I}$nter-token $\textbf{Con}$trast ($\textbf{ICon}$), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: https://inter-token-contrast.github.io/icon/
翻译:学习有效的视觉表征以用于机器人操作仍然是一个基础性挑战,这源于动作执行中涉及的复杂身体动力学。本文研究了携带身体相关线索的视觉表征如何能够为下游机器人操作任务实现高效策略学习。我们提出了$\textbf{I}$nter-token $\textbf{Con}$trast ($\textbf{ICon}$),一种应用于Vision Transformers (ViTs) 令牌级表征的对比学习方法。ICon在特征空间中强制分离智能体特定令牌与环境特定令牌,从而产生嵌入身体特定归纳偏置的以智能体为中心的视觉表征。该框架可通过将对比损失作为辅助目标,无缝集成到端到端策略学习中。实验表明,ICon不仅提升了多种操作任务的策略性能,还促进了跨不同机器人的策略迁移。项目网站:https://inter-token-contrast.github.io/icon/