ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation

Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.

翻译：图像到视频生成（I2V）旨在根据文本指令将静态图像动画化为时间上连贯的视频序列，然而在不断变化的视角下保持细粒度的物体身份一致性仍然是一个持续的挑战。与文本到视频模型不同，现有的I2V流程常常遭受外观漂移和几何失真，我们将这些伪影归因于单视图二维观测的稀疏性以及跨模态对齐的薄弱。本文从数据和模型两个角度解决这一问题。首先，我们构建了ConsIDVid，这是一个通过可扩展流程构建的大规模以物体为中心的数据集，用于生成高质量、时间对齐的视频，并建立了ConsIDVid-Bench，在此我们提出了一种新颖的基准测试和评估框架，用于多视角一致性评估，该框架采用对细微几何和外观偏差敏感的指标。我们进一步提出了ConsID-Gen，这是一个视图辅助的I2V生成框架，它通过未标定辅助视图增强第一帧，并通过双流视觉-几何编码器以及文本-视觉连接器融合语义和结构线索，从而为Diffusion Transformer主干网络提供统一的调节信号。在ConsIDVid-Bench上的实验表明，ConsID-Gen在多项指标上持续优于现有方法，其最佳综合性能超越了Wan2.1和HunyuanVideo等领先的视频生成模型，在具有挑战性的现实场景下提供了卓越的身份保真度和时间连贯性。我们将在https://myangwu.github.io/ConsID-Gen发布我们的模型和数据集。