HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation

Current identity-consistent video generation methods struggle to preserve appearance fidelity under large viewpoint changes. While introducing multi-view reference input offers a natural solution, progress remains constrained by the lack of effective frameworks for multi-view inputs and the scarcity of multi-view data. We address these challenges by proposing HarmoView, a robust framework for identity-consistent video generation that effectively integrates multi-view cues through three architectural refinements complemented by a staged training curriculum. Specifically, we first introduce Multi-level Feature Injection to anchor identity fidelity; by injecting raw ViT features from frontal references alongside text tokens via cross-attention, MFI provides persistent low-level appearance anchors that complement the high-level identity features within DiT blocks, leading to enhanced identity preservation. Then, we employ learnable proxy tokens to unify heterogeneous reference layouts across single-/multi-view settings while simultaneously resolving the reference-view mismatch problem. Jump-RoPE is further developed for identity-wise feature isolation to reduce identity crosstalk. To activate these structural capabilities while preserving the original generative priors, we propose the Progressive View Curriculum. This four-stage training strategy employs view dropout to facilitate a stable transition from vanilla T2V generation to high-fidelity, identity-persistent spatial reasoning. Furthermore, we construct a large-scale multi-view dataset to address the issue of data scarcity. Extensive evaluation on our multi-view benchmark, comprising 100 manually-curated cases spanning 52 unique identities, demonstrates that HarmoView significantly outperforms open-source baselines and matches leading closed-source engines, achieving state-of-the-art performance in identity-consistent video generation.

翻译：当前的身份一致视频生成方法在大视角变化下难以保持外观保真度。虽然引入多视角参考输入提供了一种自然解决方案，但由于缺乏处理多视角输入的有效框架以及多视角数据的稀缺性，相关进展仍受到制约。为应对这些挑战，我们提出HarmoView——一个通过三项架构改进配合分阶段训练课程有效融合多视角线索的鲁棒身份一致视频生成框架。具体而言，我们首先引入多层级特征注入（Multi-level Feature Injection，MFI）以锚定身份保真度：通过交叉注意力机制将正面参考图像的原始ViT特征与文本标记共同注入，MFI提供持续的低层级外观锚点，补充DiT模块内的高层级身份特征，从而增强身份保留能力。其次，采用可学习代理标记统一单/多视角设置下的异构参考布局，同时解决参考-视角不匹配问题。进一步开发跳跃式旋转位置编码（Jump-RoPE）实现逐身份特征隔离以减少身份交叉干扰。为激活上述结构能力同时保留原始生成先验，我们提出渐进式视角训练课程（Progressive View Curriculum）。该四阶段训练策略采用视角丢弃（view dropout）技术，促进从基础文本到视频生成向高保真度身份保持空间推理的稳定过渡。此外，我们构建大规模多视角数据集以解决数据稀缺问题。在包含100个人工精选案例（涵盖52个独特身份）的多视角基准测试上的广泛评估表明，HarmoView显著优于开源基线模型，并达到与领先闭源引擎相当的水平，在身份一致视频生成中实现最先进性能。