Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation, pushing forward advancements in text-to-image generation. However, achieving accurate text-image alignment for LMMs, particularly in compositional scenarios, remains challenging. Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering, costly human annotations, and continual upgrading, limiting flexibility and scalability. In this work, we introduce a model-agnostic iterative self-improvement framework (SILMM) that can enable LMMs to provide helpful and scalable self-feedback and optimize text-image alignment via Direct Preference Optimization (DPO). DPO can readily applied to LMMs that use discrete visual tokens as intermediate image representations; while it is less suitable for LMMs with continuous visual features, as obtaining generation probabilities is challenging. To adapt SILMM to LMMs with continuous features, we propose a diversity mechanism to obtain diverse representations and a kernel-based continuous DPO for alignment. Extensive experiments on three compositional text-to-image generation benchmarks validate the effectiveness and superiority of SILMM, showing improvements exceeding 30% on T2I-CompBench++ and around 20% on DPG-Bench.
翻译:大型多模态模型(LMMs)在多模态理解与生成方面展现出卓越能力,推动了文本到图像生成领域的进步。然而,实现LMMs在文本与图像间的精确对齐——尤其是在组合式场景中——仍然具有挑战性。现有方法,如用于多步生成的布局规划、从人类反馈或AI反馈中学习等,严重依赖于提示工程、昂贵的人工标注和持续升级,限制了灵活性与可扩展性。本研究提出一种与模型无关的迭代式自改进框架(SILMM),该框架能够使LMMs提供有效且可扩展的自我反馈,并通过直接偏好优化(DPO)优化文本-图像对齐。DPO可直接应用于使用离散视觉标记作为中间图像表示的LMMs;但对于具有连续视觉特征的LMMs则较难适用,因为获取生成概率具有挑战性。为使SILMM适应具有连续特征的LMMs,我们提出一种多样性机制以获取多样化表示,以及一种基于核的连续DPO方法用于对齐优化。在三个组合式文本到图像生成基准上的大量实验验证了SILMM的有效性与优越性,其在T2I-CompBench++上实现了超过30%的性能提升,在DPG-Bench上提升了约20%。