We present UniRef-Image-Edit, a high-performance multi-modal generation system that unifies single-image editing and multi-image composition within a single framework. Existing diffusion-based editing methods often struggle to maintain consistency across multiple conditions due to limited interaction between reference inputs. To address this, we introduce Sequence-Extended Latent Fusion (SELF), a unified input representation that dynamically serializes multiple reference images into a coherent latent sequence. During a dedicated training stage, all reference images are jointly constrained to fit within a fixed-length sequence under a global pixel-budget constraint. Building upon SELF, we propose a two-stage training framework comprising supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we jointly train on single-image editing and multi-image composition tasks to establish a robust generative prior. We adopt a progressive sequence length training strategy, in which all input images are initially resized to a total pixel budget of $1024^2$, and are then gradually increased to $1536^2$ and $2048^2$ to improve visual fidelity and cross-reference consistency. This gradual relaxation of compression enables the model to incrementally capture finer visual details while maintaining stable alignment across references. For the RL stage, we introduce Multi-Source GRPO (MSGRPO), to our knowledge the first reinforcement learning framework tailored for multi-reference image generation. MSGRPO optimizes the model to reconcile conflicting visual constraints, significantly enhancing compositional consistency. We will open-source the code, models, training data, and reward data for community research purposes.
翻译:本文提出UniRef-Image-Edit,一个高性能多模态生成系统,将单图像编辑与多图像合成统一在单一框架内。现有的基于扩散的编辑方法由于参考输入间交互有限,往往难以在多个条件下保持一致性。为解决此问题,我们引入序列扩展潜在融合(SELF),这是一种统一的输入表示方法,可将多个参考图像动态序列化为连贯的潜在序列。在专门的训练阶段,所有参考图像在全局像素预算约束下被联合约束以适配固定长度的序列。基于SELF,我们提出一个包含监督微调(SFT)和强化学习(RL)的两阶段训练框架。在SFT阶段,我们联合训练单图像编辑和多图像合成任务,以建立稳健的生成先验。我们采用渐进式序列长度训练策略,其中所有输入图像最初被调整至总像素预算为$1024^2$,随后逐步增加到$1536^2$和$2048^2$,以提高视觉保真度和跨参考一致性。这种压缩的逐步放宽使模型能够逐步捕获更精细的视觉细节,同时保持参考间的稳定对齐。对于RL阶段,我们引入多源GRPO(MSGRPO),据我们所知,这是首个专为多参考图像生成定制的强化学习框架。MSGRPO优化模型以协调冲突的视觉约束,显著提升合成一致性。我们将开源代码、模型、训练数据和奖励数据,以供社区研究之用。