How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.
翻译:如何判断视频被加速或减速?如何生成不同速度的视频?尽管视频已成为现代计算机视觉研究的核心,但人们对时间流逝的感知与控制却关注甚少。本文研究时间作为可学习的视觉概念,并构建模型以推理和操控视频中的时间流。我们首先利用视频中自然存在的多模态线索与时间结构,通过自监督方式学习检测速度变化并估计播放速度。随后,这些时间推理模型使我们能够从复杂的现实场景中整理出迄今为止最大的慢动作视频数据集。这种通常由高速摄像机拍摄的慢动作素材,比标准视频包含更丰富的时间细节。基于该数据,我们进一步开发具有时间控制能力的模型,包括速度条件视频生成(以指定播放速度生成运动)以及时间超分辨率(将低帧率模糊视频转化为包含精细时间细节的高帧率序列)。研究成果凸显时间作为视频学习中可操控的感知维度,为时间可控视频生成、时间取证检测以及理解事件时间展开方式的更丰富世界模型开辟了新方向。