We present Kaleido, a subject-to-video~(S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects. Despite recent progress in S2V generation models, existing approaches remain inadequate at maintaining multi-subject consistency and at handling background disentanglement, often resulting in lower reference fidelity and semantic drift under multi-image conditioning. These shortcomings can be attributed to several factors. Primarily, the training dataset suffers from a lack of diversity and high-quality samples, as well as cross-paired data, i.e., paired samples whose components originate from different instances. In addition, the current mechanism for integrating multiple reference images is suboptimal, potentially resulting in the confusion of multiple subjects. To overcome these limitations, we propose a dedicated data construction pipeline, incorporating low-quality sample filtering and diverse data synthesis, to produce consistency-preserving training data. Moreover, we introduce Reference Rotary Positional Encoding (R-RoPE) to process reference images, enabling stable and precise multi-image integration. Extensive experiments across numerous benchmarks demonstrate that Kaleido significantly outperforms previous methods in consistency, fidelity, and generalization, marking an advance in S2V generation.
翻译:本文提出Kaleido——一种主体到视频(S2V)生成框架,旨在基于目标主体的多张参考图像合成主体一致性的视频。尽管当前S2V生成模型已取得进展,现有方法在处理多主体一致性与背景解耦方面仍存在不足,常导致多图像条件下参考保真度降低与语义漂移。这些缺陷可归因于若干因素:首先,训练数据集存在多样性不足、高质量样本稀缺以及跨配对数据(即各组件源自不同实例的配对样本)匮乏的问题;其次,现有多参考图像融合机制尚不完善,易引发多主体混淆。为突破这些局限,我们设计了专门的数据构建流程,通过低质量样本过滤与多样化数据合成来生成保持一致性的训练数据。此外,我们提出参考旋转位置编码(R-RoPE)来处理参考图像,实现稳定精准的多图像融合。在多个基准测试上的大量实验表明,Kaleido在一致性、保真度与泛化能力方面显著优于现有方法,标志着S2V生成领域的进步。