Recent advances in Multimodal Large Language Models (MLLMs) have enabled unified multimodal understanding and generation. However, they still struggle with fine-grained text-image alignment, often failing to faithfully depict objects with correct attributes such as color, shape, and spatial relations. To mitigate this issue, previous studies have explored preference optimization methods such as DPO and GRPO, but these approaches incur substantial computational cost, both in constructing preference data and in performing optimization. This has motivated self-improving preference optimization approaches, in which the MLLM autonomously generates its own training data, self-estimates preference feedback, and self-optimizes using the resulting self-constructed preference pairs. However, existing self-improving methods still overlook fine-grained, object-level semantics, allowing object hallucination to persist. To tackle this problem, we propose Object-centric Self-improving Preference Optimization (OSPO), a self-improving framework designed to enhance object-level text-image alignment. OSPO explicitly constructs object-centric preference data without relying on any external data and external models. We also introduce a new approach that leverages attention-based object masks together with an object-weighted SimPO loss to enhance object-specific fidelity. Extensive experiments on three compositional image generation benchmarks demonstrate that OSPO significantly improves fine-grained alignment and reduces object hallucination, outperforming prior self-improving methods and even specialized diffusion-based text-to-image models.
翻译:多模态大语言模型(MLLMs)的最新进展已实现统一的多模态理解与生成。然而,它们在细粒度的文本-图像对齐方面仍面临挑战,常常无法准确描绘具有正确属性(如颜色、形状和空间关系)的对象。为缓解此问题,先前研究探索了如DPO和GRPO等偏好优化方法,但这些方法在构建偏好数据和执行优化过程中均产生巨大的计算成本。这推动了自改进偏好优化方法的发展,其中MLLM自主生成其训练数据,自我估计偏好反馈,并利用由此构建的自生成偏好对进行自我优化。然而,现有的自改进方法仍忽视了细粒度的对象级语义,导致对象幻觉持续存在。为解决此问题,我们提出了对象中心自改进偏好优化(OSPO),这是一个旨在增强对象级文本-图像对齐的自改进框架。OSPO明确构建对象中心的偏好数据,且不依赖任何外部数据和外部模型。我们还引入了一种新方法,利用基于注意力的对象掩码与对象加权的SimPO损失相结合,以增强对象特定的保真度。在三个组合图像生成基准上的大量实验表明,OSPO显著改善了细粒度对齐并减少了对象幻觉,其性能优于先前的自改进方法,甚至超越了专门的基于扩散的文本到图像模型。