Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.
翻译:身份一致生成已成为文本到图像研究的重要焦点,近期模型在生成与参考身份对齐的图像方面取得了显著成功。然而,由于缺乏包含同一人物多张图像的大规模配对数据集,大多数方法被迫采用基于重建的训练方式。这种依赖性常导致我们称为"复制-粘贴"的失效模式,即模型直接复制参考面部,而非在姿态、表情或光照的自然变化中保持身份一致性。这种过度相似性损害了可控性并限制了生成的表现力。为应对这些局限性,我们(1)构建了专为多人场景设计的大规模配对数据集MultiID-2M,为每个身份提供多样化参考;(2)提出了量化复制-粘贴伪影及身份保真度与变化度权衡的基准测试;(3)提出采用对比身份损失的新型训练范式,利用配对数据平衡保真度与多样性。这些贡献最终形成了WithAnyone——一个基于扩散的模型,能有效缓解复制-粘贴问题同时保持高身份相似度。大量定性与定量实验表明,WithAnyone显著减少复制-粘贴伪影,提升姿态与表情的可控性,并保持强大的感知质量。用户研究进一步验证了我们的方法在实现高身份保真度的同时,能够进行富有表现力的可控生成。