Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation--a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduce LatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) a condenser for summarizing intermediate generation states into compact visual memory, (ii) a translator for converting latent thoughts into actionable guidance, (iii) a shaper for dynamically steering next image token predictions, and (iv) an RL-trained invoker for adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph (I) enhances the base model Janus-Pro by $16\%$ on GenEval and $25\%$ on T2I-CompBench; (II) outperforms explicit paradigms (e.g., TwiG) by $15\%$ and $11\%$ on abstract reasoning tasks like WISE and IPV-Txt, (III) while reducing inference time by $44\%$ and token consumption by $51\%$; and (IV) exhibits $71\%$ cognitive alignment with human intuition on reasoning invocation.
翻译:文本到图像(T2I)生成已取得显著进展,然而现有方法通常缺乏在生成过程中进行动态推理与优化的能力——而这正是人类创造力的标志。当前增强推理的范式大多依赖显式的思维过程,其中间推理在固定步骤被解码为离散文本,并频繁进行图像解码与重新编码,导致效率低下、信息丢失和认知失配。为弥合这一差距,我们提出了LatentMorph,一个将隐式潜在推理无缝集成到T2I生成过程中的新颖框架。其核心在于引入了四个轻量级组件:(i) 一个用于将中间生成状态总结为紧凑视觉记忆的冷凝器,(ii) 一个用于将潜在思维转化为可操作指导的翻译器,(iii) 一个用于动态引导下一个图像令牌预测的塑形器,以及(iv) 一个通过强化学习训练、用于自适应决定何时调用推理的调用器。通过在连续的潜在空间中完全执行推理,LatentMorph避免了显式推理的瓶颈,并实现了更自适应的自我优化。大量实验表明,LatentMorph (I) 在GenEval上将基础模型Janus-Pro的性能提升了$16\%$,在T2I-CompBench上提升了$25\%$;(II) 在抽象推理任务(如WISE和IPV-Txt)上,其表现优于显式范式(例如TwiG)$15\%$和$11\%$,(III) 同时将推理时间减少了$44\%$,令牌消耗减少了$51\%$;并且(IV) 在推理调用方面,与人类直觉的认知一致性达到了$71\%$。