Existing Subject-to-Video Generation (S2V) methods have achieved high-fidelity and subject-consistent video generation, yet remain constrained to single-view subject references. This limitation renders the S2V task reducible to an S2I + I2V pipeline, failing to exploit the full potential of video subject control. In this work, we propose and address the challenging Multi-View S2V (MV-S2V) task, which synthesizes videos from multiple reference views to enforce 3D-level subject consistency. Regarding the scarcity of training data, we first develop a synthetic data curation pipeline to generate highly customized synthetic data, complemented by a small-scale real-world captured dataset to boost the training of MV-S2V. Another key issue lies in the potential confusion between cross-subject and cross-view references in conditional generation. To overcome this, we further introduce Temporally Shifted RoPE (TS-RoPE) to distinguish between different subjects and distinct views of the same subject in reference conditioning. Our framework achieves superior 3D subject consistency w.r.t. multi-view reference images and high-quality visual outputs, establishing a new meaningful direction for subject-driven video generation. Our project page is available at: https://szy-young.github.io/mv-s2v
翻译:现有的主体到视频生成方法虽然已能实现高保真且主体一致的视频生成,但仍局限于单一视角的主体参考。这一限制使得S2V任务可简化为S2I + I2V的流水线,未能充分发挥视频主体控制的全部潜力。本文提出并解决了具有挑战性的多视角S2V任务,该任务通过合成来自多个参考视角的视频,以强制实现三维层面的主体一致性。针对训练数据稀缺的问题,我们首先开发了一个合成数据构建流水线,用于生成高度定制的合成数据,并辅以一个小规模的真实世界采集数据集,以促进MV-S2V的训练。另一个关键问题在于条件生成中跨主体与跨视角参考之间可能存在的混淆。为此,我们进一步引入了时序偏移RoPE,以在参考条件中区分不同主体以及同一主体的不同视角。我们的框架在多视角参考图像方面实现了卓越的三维主体一致性和高质量的视觉输出,为主体驱动视频生成确立了一个富有意义的新方向。项目页面位于:https://szy-young.github.io/mv-s2v