RefAlign: Representation Alignment for Reference-to-Video Generation

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

翻译：参考到视频生成是一种可控的视频合成范式，它利用文本提示和参考图像共同约束生成过程，从而支持个性化广告和虚拟试穿等应用。在实践中，现有的参考到视频方法通常会在参考图像的VAE潜空间表示基础上引入额外的高层语义或跨模态特征，并将其共同输入扩散Transformer。这些辅助表示提供语义指导并作为隐式对齐信号，能够部分缓解VAE潜空间中的像素级信息泄露问题。然而，它们仍可能难以应对由于异质编码器特征间的模态不匹配所导致的复制粘贴伪影和多主体混淆。本文提出RefAlign，一种表示对齐框架，它将扩散Transformer参考分支特征显式对齐到视觉基础模型的语义空间。RefAlign的核心是一个参考对齐损失，它拉近同一主体的参考特征与视觉基础模型特征以提高身份一致性，同时推开不同主体的对应特征以增强语义区分性。这一简单而有效的策略仅在训练阶段应用，不增加推理开销，并在文本可控性与参考保真度之间实现了更好的平衡。在OpenS2V-Eval基准上的大量实验表明，RefAlign在总分指标上优于当前最先进方法，验证了显式参考对齐对于参考到视频任务的有效性。