We proposed Lavida-O, a unified multi-modal Masked Diffusion Model (MDM) capable of image understanding and generation tasks. Unlike existing multimodal diffsion language models such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O exhibits many new capabilities such as object grounding, image-editing, and high-resolution (1024px) image synthesis. It is also the first unified MDM that uses its understanding capabilities to improve image generation and editing results through planning and iterative self-reflection. To allow effective and efficient training and sampling, Lavida-O ntroduces many novel techniques such as Elastic Mixture-of-Transformer architecture, universal text conditioning, and stratified sampling. \ours~achieves state-of-the-art performance on a wide range of benchmarks such as RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference.
翻译:我们提出了Lavida-O,一种能够执行图像理解与生成任务的统一多模态掩码扩散模型。与仅支持简单图像级理解任务和低分辨率图像生成的现有多模态扩散语言模型(如MMaDa和Muddit)不同,Lavida-O展现出多项新能力,包括对象定位、图像编辑以及高分辨率(1024像素)图像合成。它也是首个利用其理解能力,通过规划与迭代自反思机制来提升图像生成与编辑效果的统一掩码扩散模型。为实现高效训练与采样,Lavida-O引入了多项创新技术,如弹性混合Transformer架构、通用文本条件机制和分层采样策略。本模型在RefCOCO对象定位、GenEval文本到图像生成和ImgEdit图像编辑等广泛基准测试中取得了最先进的性能,超越了现有的自回归与连续扩散模型(如Qwen2.5-VL和FluxKontext-dev),同时在推理阶段实现了显著的加速。