Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/
翻译:统一多模态模型能够在单一架构内同时实现视觉内容的理解与生成。然而,现有模型仍存在数据需求量大、模型体量过重的问题,难以部署于边缘设备。本文提出Mobile-O,一种紧凑的视觉-语言-扩散模型,将统一多模态智能引入移动设备。其核心模块——移动条件投影器(MCP)——采用深度可分离卷积与层级对齐机制,将视觉-语言特征与扩散生成器高效融合。该设计以极低计算成本实现了高效的跨模态条件建模。通过仅数百万样本的训练及新颖的四元组格式(生成提示、图像、问题、答案)后训练,Mobile-O同步增强了视觉理解与生成能力。尽管模型高效轻量,Mobile-O在性能上仍与现有统一模型相当或更优:在GenEval基准上达到74%,分别超越Show-O与JanusFlow模型5%与11%,同时推理速度提升6倍与11倍。在视觉理解任务中,Mobile-O在七项基准测试上的平均性能分别领先15.3%与5.1%。在iPhone设备上仅需约3秒即可生成512×512图像,Mobile-O首次建立了边缘设备上实时统一多模态理解与生成的实用框架。我们希望Mobile-O能推动完全基于设备、无需云端依赖的实时统一多模态智能研究。代码、模型、数据集及移动端应用已公开于https://amshaker.github.io/Mobile-O/。