像素到四维：基于动态三维高斯分布的相机可控图像到视频生成 (Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians)

Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency. The project page is available at https://melonienimasha.github.io/Pixel-to-4D-Website.

翻译：人类仅凭单张图像就能出色预测场景的未来动态。能够模拟这种能力的视频生成模型是智能系统的重要组成部分。近期方法在单图像条件视频生成中改善了时间连贯性与三维一致性。然而，这些方法往往缺乏鲁棒的用户可控性（例如修改相机路径），限制了其在实际应用中的适用性。现有大多数相机可控图像到视频模型难以准确建模相机运动、保持时间一致性并维持几何完整性。利用显式中介三维表征提供了一种有前景的解决方案，可实现与给定相机轨迹对齐的连贯视频生成。尽管这些方法通常使用三维点云渲染场景并在后续阶段引入物体运动，但这种两步流程在实现完全时间一致性方面仍存在不足，尽管允许对相机运动进行精确控制。我们提出了一种新颖框架，该框架可在单次前向传播中，基于单张图像构建三维高斯场景表征并采样合理的物体运动。这实现了无需通过迭代去噪将物体运动注入渲染帧的快速相机引导视频生成。在KITTI、Waymo、RealEstate10K和DL3DV-10K数据集上的大量实验表明，我们的方法在视频质量和推理效率方面均达到了最先进水平。项目页面详见https://melonienimasha.github.io/Pixel-to-4D-Website。