For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations.
翻译:对于在动态环境中运行的机器人智能体而言,从流式视频观测中学习视觉状态表示对序列决策至关重要。近期自监督学习方法在视觉任务间展现出强大的可迁移性,但这些方法并未明确阐释优质视觉状态应编码何种信息。我们认为,有效的视觉状态必须通过联合编码场景元素的语义身份及其空间位置来捕获"何物何处"信息,从而实现对观测间细微动态的可靠检测。为此,我们提出CroBo——一种基于全局到局部重建目标的视觉状态表示学习框架。给定压缩为紧凑瓶颈令牌的参考观测,CroBo利用全局瓶颈令牌作为上下文,通过稀疏可见线索学习重建局部目标裁剪区域中严重掩蔽的图像块。该学习目标促使瓶颈令牌编码场景级语义实体的细粒度表示,包括其身份、空间位置与构型。因此,习得的视觉状态能够揭示场景元素随时间推移如何运动与交互,从而支持序列决策。我们在多样化基于视觉的机器人策略学习基准上评估CroBo,其性能达到最先进水平。重建分析与感知直线性实验进一步表明,学习到的表示保留了像素级场景组合,并编码了观测间的"何物移向何处"信息。