Video is a scalable observation of physical dynamics: it captures how objects move, how contact unfolds, and how scenes evolve under interaction -- all without requiring robot action labels. Yet translating this temporal structure into reliable robotic control remains an open challenge, because video lacks action supervision and differs from robot experience in embodiment, viewpoint, and physical constraints. This survey reviews methods that exploit non-action-annotated temporal video to learn control interfaces for robotic manipulation. We introduce an interface-centric taxonomy organized by where the video-to-control interface is constructed and what control properties it enables, identifying three families: direct video-action policies, which keep the interface implicit; latent-action methods, which route temporal structure through a compact learned intermediate; and explicit visual interfaces, which predict interpretable targets for downstream control. For each family, we analyze control-integration properties -- how the loop is closed, what can be verified before execution, and where failures enter. A cross-family synthesis reveals that the most pressing open challenges center on the robotics integration layer -- the mechanisms that connect video-derived predictions to dependable robot behavior -- and we outline research directions toward closing this gap.
翻译:视频是一种可扩展的物理动态观测手段:它记录物体如何运动、接触如何展开、场景如何在交互中演变——所有这些都不需要机器人动作标签。然而,将这种时序结构转化为可靠的机器人控制仍然是一个开放性挑战,因为视频缺乏动作监督,且在具身形态、视角和物理约束方面与机器人经验存在差异。本综述系统梳理了利用无动作标注的时序视频学习机器人操作控制接口的方法。我们提出一种以接口为中心的分类体系,按照视频到控制接口的构建位置及其赋予的控制特性进行划分,归纳出三大类:直接视频-动作策略(保持接口隐式)、潜空间动作方法(通过紧凑的学得中间表示传递时序结构)以及显式视觉接口(为下游控制预测可解释的目标)。针对每一类,我们分析其控制集成特性——闭环如何实现、执行前可验证的内容以及故障产生环节。跨类综合研究表明,最紧迫的开放性挑战聚焦于机器人集成层——即连接视频衍生预测与可靠机器人行为的机制——我们据此勾勒出弥合该差距的研究方向。