UniCorn：通过自生成监督迈向自我改进的统一多模态模型 (UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision)

While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.

翻译：尽管统一多模态模型（UMMs）在跨模态理解方面取得了显著成功，但其在利用此类内部知识进行高质量生成方面的能力仍存在显著差距。我们将这种差异形式化为传导性失语症，即模型能够准确解释多模态输入，却难以将这种理解转化为忠实且可控的合成。为解决此问题，我们提出了UniCorn，一个简单而优雅的自我改进框架，无需外部数据或教师监督。通过将单个UMM划分为三个协作角色：提议者、求解者和评判者，UniCorn通过自我对弈生成高质量交互，并采用认知模式重建将潜在理解提炼为显式生成信号。为验证多模态连贯性的恢复，我们引入了UniCycle，一个基于文本到图像到文本重建循环的循环一致性基准。大量实验表明，UniCorn在六个通用图像生成基准上相比基础模型实现了全面且显著的改进。值得注意的是，它在TIIF（73.8）、DPG（86.8）、CompBench（88.5）和UniCycle上达到了SOTA性能，同时在WISE和OneIG上进一步实现了+5.0和+6.5的显著提升。这些结果表明，我们的方法在保持强大理解能力的同时，显著增强了文本到图像生成性能，证明了完全自监督精炼对于统一多模态智能的可扩展性。