Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control

Video diffusion models have rich world priors, but their use in spatial tasks is limited by poor control, spatial-temporal inconsistent results, and entangled scene-camera dynamics. Current approaches, such as per-task fine-tuning or post-process warping, often introduce visual artifacts, fail to generalize, or incur high computational costs. We introduce WorldForge, a novel, training-free framework that operates purely at inference time to resolve these issues. Our method comprises three synergistic components. First, an intra-step refinement loop injects fine-grained motion guidance during the denoising process, iteratively correcting the output to ensure strict adherence to the target camera path. Second, an optical flow-based analysis identifies and isolates motion-related channels within the latent space. This allows our framework to selectively apply guidance, thereby decoupling motion from appearance and preserving visual fidelity. Third, a dual-path guidance strategy adaptively corrects for drift by comparing the guided generation against an unguided, reference denoising path, effectively neutralizing artifacts caused by misaligned structural inputs. Together, these components inject precise, trajectory-aligned control without model retraining, achieving accurate motion guidance and photorealistic synthesis. As a plug-and-play, model-agnostic solution, WorldForge demonstrates highly versatile generalizability. Beyond robust zero-shot 3D/4D generation, it readily empowers over a dozen diverse downstream applications, seamlessly enabling tasks like video editing, stabilization, and virtual try-on. Extensive experiments confirm state-of-the-art performance in trajectory adherence and perceptual quality, outperforming both training-dependent and inference-only baselines.

翻译：视频扩散模型蕴含丰富的世界先验知识，但在空间任务中的应用受限于控制能力弱、时空不一致结果以及场景-相机动力学纠缠等问题。当前方法（如逐任务微调或后处理扭曲）常引入视觉伪影、泛化性不足或计算成本高昂。我们提出WorldForge——一种新颖的无训练框架，仅在推理阶段运行即可解决上述问题。该方法包含三个协同组件：首先，帧内细化循环在去噪过程中注入细粒度运动引导，通过迭代校正输出确保严格遵循目标相机路径；其次，基于光流的分析识别并分离潜空间中的运动相关通道，使框架能选择性施加引导，从而解耦运动与外观并保持视觉保真度；第三，双路径引导策略通过对比引导生成与无引导参考去噪路径，自适应校正漂移，有效中和因结构输入错位导致的伪影。这些组件协同作用，在不重新训练模型的前提下注入精确的轨迹对齐控制，实现准确运动引导与逼真合成。作为即插即用、模型无关的解决方案，WorldForge展现出高度通用的泛化能力。除稳健的零样本3D/4D生成外，它还能赋能超十余种下游应用，无缝实现视频编辑、稳定化和虚拟试穿等任务。大量实验证实其在轨迹遵循度和感知质量上达到最先进水平，性能超越依赖训练和仅推理的基线方法。