We present CyCLeGen, a unified vision-language foundation model capable of both image understanding and image generation within a single autoregressive framework. Unlike existing vision models that depend on separate modules for perception and synthesis, CyCLeGen adopts a fully integrated architecture that enforces cycle-consistent learning through image->layout->image and layout->image->layout generation loops. This unified formulation introduces two key advantages: introspection, enabling the model to reason about its own generations, and data efficiency, allowing self-improvement via synthetic supervision under a reinforcement learning objective guided by cycle consistency. Extensive experiments show that CyCLeGen achieves significant gains across diverse image understanding and generation benchmarks, highlighting the potential of unified vision-language foundation models.
翻译:本文提出CyCLeGen,一种能够在单一自回归框架内同时完成图像理解与图像生成的统一视觉-语言基础模型。与现有依赖独立模块分别处理感知与合成的视觉模型不同,CyCLeGen采用完全集成的架构,通过图像->布局->图像和布局->图像->布局的生成循环强制实现循环一致性学习。这种统一框架带来两个关键优势:自省能力——使模型能够对其自身生成结果进行推理;数据效率——在循环一致性引导的强化学习目标下,通过合成监督实现自我改进。大量实验表明,CyCLeGen在多种图像理解与生成基准测试中均取得显著性能提升,彰显了统一视觉-语言基础模型的潜力。