Video representation is a long-standing problem that is crucial for various down-stream tasks, such as tracking,depth prediction,segmentation,view synthesis,and editing. However, current methods either struggle to model complex motions due to the absence of 3D structure or rely on implicit 3D representations that are ill-suited for manipulation tasks. To address these challenges, we introduce a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians. Our proposed representation models video appearance in a 3D canonical space using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion. This approach offers a more intrinsic and explicit representation than layered atlas or volumetric pixel matrices. To obtain such a representation, we distill 2D priors, such as optical flow and depth, from foundation models to regularize learning in this ill-posed setting. Extensive applications demonstrate the versatility of our new video representation. It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation. Project page: https://sunyangtian.github.io/spatter_a_video_web/
翻译:视频表示是一个长期存在的问题,对多种下游任务至关重要,例如跟踪、深度预测、分割、视图合成和编辑。然而,现有方法要么因缺乏3D结构而难以建模复杂运动,要么依赖于不适合操控任务的隐式3D表示。为应对这些挑战,我们引入了一种新颖的显式3D表示——视频高斯表示——它将视频嵌入到3D高斯中。我们提出的表示方法使用显式高斯作为代理,在3D规范空间中建模视频外观,并将每个高斯与3D运动相关联以表示视频运动。相较于分层图册或体素矩阵,该方法提供了更本质且显式的表示。为获得此种表示,我们从基础模型中提取二维先验(如光流和深度),以在此不适定场景中正则化学习过程。广泛的应用证明了我们新型视频表示的多功能性。该表示已在众多视频处理任务中被证明有效,包括跟踪、一致性视频深度与特征优化、运动与外观编辑以及立体视频生成。项目页面:https://sunyangtian.github.io/spatter_a_video_web/