Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.
翻译:大语言模型(LLMs)广泛应用于文本到图像(T2I)系统,但其通常仅用于文本编码,而去噪过程则由新训练的生成骨干网络处理。表示自编码器(RAEs)的出现将生成目标转向语义结构化的视觉表示,从而构建出与预训练LLM先验更兼容的潜空间。受多模态大语言模型(MLLMs)启发——其通过一个MLP投影器即可将干净的视觉表示与预训练LLM对齐——我们将MLLM本身改造为噪声表示编码器,将此机制从干净输入扩展到含噪输入。我们提出RepFusion,该方法利用MLLM的输出结果作为扩散Transformer的条件信号。在相似推理预算的控制对比实验中,RepFusion优于将等量计算资源分配给新初始化解码器的基线方法。这些结果表明,MLLMs为视觉表示去噪提供了强先验,且通过以动态噪声表示作为条件,现代T2I系统可在测试阶段将计算资源高效地投入到重复的MLLM条件处理中。