Existing Subject-to-Video Generation (S2V) methods have achieved high-fidelity and subject-consistent video generation, yet remain constrained to single-view subject references. This limitation renders the S2V task reducible to an S2I + I2V pipeline, failing to exploit the full potential of video subject control. In this work, we propose and address the challenging Multi-View S2V (MV-S2V) task, which synthesizes videos from multiple reference views to enforce 3D-level subject consistency. Regarding the scarcity of training data, we first develop a synthetic data curation pipeline to generate highly customized synthetic data, complemented by a small-scale real-world captured dataset to boost the training of MV-S2V. Another key issue lies in the potential confusion between cross-subject and cross-view references in conditional generation. To overcome this, we further introduce Temporally Shifted RoPE (TS-RoPE) to distinguish between different subjects and distinct views of the same subject in reference conditioning. Our framework achieves superior 3D subject consistency w.r.t. multi-view reference images and high-quality visual outputs, establishing a new meaningful direction for subject-driven video generation. Code and data are available at: https://szy-young.github.io/mv-s2v
翻译:现有的主体到视频生成(Subject-to-Video Generation, S2V)方法虽已实现高保真且主体一致的视频生成,但仍局限于单一视角的主体参考。这一局限使得S2V任务可简化为S2I + I2V流水线,未能充分发挥视频主体控制的潜力。本文提出并攻克了具有挑战性的多视角S2V(Multi-View S2V, MV-S2V)任务,通过合成来自多个参考视角的视频,以实现三维层面上的主体一致性。针对训练数据稀缺问题,我们首先开发了一套合成数据构建流水线,用于生成高度定制化的合成数据,并辅以小规模真实世界采集数据集以提升MV-S2V的训练效果。另一关键问题在于条件生成中跨主体参考与跨视角参考之间可能产生的混淆。为此,我们进一步引入时序移位旋转位置编码(Temporally Shifted RoPE, TS-RoPE),以区分参考条件中的不同主体及同一主体的不同视角。本框架在参考多视角图像条件下实现了优异的三维主体一致性并生成高质量视觉输出,为主体驱动视频生成开辟了有意义的新方向。代码与数据已公开于:https://szy-young.github.io/mv-s2v