In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model's unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.
翻译:摘要:本文提出EchoGen——一种统一框架,可同时实现布局到图像生成与图像定位功能,既能根据精确布局生成高保真文本描述(如空间关系)的图像,又能同步实现鲁棒的图像定位。我们认为图像定位具备强大的文本与布局理解能力,可弥补布局到图像生成任务中对应的不足;同时,布局生成的图像在内容上具有高度多样性,从而增强图像定位的鲁棒性。将两项任务统一在单一模型中进行联合训练,可促进各自性能的提升。然而,我们发现这种联合训练范式面临若干优化挑战,导致性能受限。为此,我们提出渐进式训练策略:首先,并行多任务预训练阶段为模型赋予两项任务的基本能力,利用共享token加速训练;其次,双任务联合优化阶段借助任务对偶性,将两项任务逐步整合实现统一优化;最后,循环强化学习阶段以一致性约束作为奖励,消除对视觉监督的依赖,通过GRPO策略显著增强模型的统一能力。大量实验表明,本方法在布局到图像生成与图像定位基准测试中均达到最优性能,并揭示出两项任务联合优化带来的显著协同增益。