We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a lightweight generation branch with a larger understanding branch, supported by token compression, universal text conditioning and stratified sampling for efficient and high-quality generation. Lavida-O further incorporates planning and iterative self-reflection in image generation and editing tasks, seamlessly boosting generation quality with its understanding capabilities. Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive models and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference. These advances establish Lavida-O as a new paradigm for scalable multimodal reasoning and generation.
翻译:本文提出Lavida-O,一种用于多模态理解与生成的统一掩码扩散模型。与MMaDa、Muddit等现有多模态掩码扩散模型仅支持简单图像级理解任务和低分辨率图像生成不同,Lavida-O通过单一框架实现了图像级理解、目标定位、图像编辑及高分辨率(1024像素)文生图合成。该模型采用创新的弹性混合Transformer架构,通过令牌压缩、通用文本条件编码与分层采样技术,将轻量级生成分支与大型理解分支相耦合,实现高效高质量生成。Lavida-O进一步在图像生成与编辑任务中引入规划与迭代自反思机制,凭借其理解能力无缝提升生成质量。在RefCOCO目标定位、GenEval文生图生成、ImgEdit图像编辑等广泛基准测试中,Lavida-O均取得最先进性能,超越Qwen2.5-VL、FluxKontext-dev等现有自回归模型与连续扩散模型,同时实现显著推理加速。这些突破使Lavida-O成为可扩展多模态推理与生成的新范式。