Unified Multimodal Models (UMMs) integrate multimodal understanding and generation, yet they are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images. In this work, we propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts. Algorithmically, we introduce two innovations: 1) The learnable latent separators explicitly differentiate each reference image in the latent space, enabling accurate and disentangled conditioning. 2) The sinusoidal index encoding assigns visual tokens from the same image a continuous sinusoidal index embedding, which provides explicit image identity while allowing generalization and extrapolation on a variable number of inputs. To facilitate training and evaluation, we establish a high-fidelity benchmark using an inverse dataset construction methodology to guarantee artifact-free, achievable outputs. Experiments show clear improvements in semantic consistency, visual fidelity, and cross-image integration over prior baselines on diverse multi-image editing tasks, validating our advantages on consistency and generalization ability.
翻译:统一多模态模型(UMMs)整合了多模态理解与生成能力,但在参考多张输入图像的细节时,其在保持视觉一致性和消除视觉线索歧义方面仍存在局限。本研究提出一种可扩展的多图像编辑框架,该框架能够显式区分图像身份并泛化至可变数量的输入。在算法层面,我们引入两项创新:1)可学习的潜在分离器在潜在空间中显式区分每张参考图像,实现精确且解耦的条件控制;2)正弦索引编码为来自同一图像的视觉标记分配连续的正弦索引嵌入,在提供显式图像身份的同时,允许对可变数量输入进行泛化和外推。为促进训练与评估,我们采用逆向数据集构建方法建立了高保真度基准,以确保生成无伪影且可实现的输出。实验表明,在多样化的多图像编辑任务中,本方法在语义一致性、视觉保真度和跨图像整合方面均较现有基线有明显提升,验证了我们在一致性和泛化能力上的优势。