Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Abdelrahman Shaker,Ahmed Heakl,Jaseel Muhammad,Ritesh Thawkar,Omkar Thawakar,Senmao Li,Hisham Cholakkal,Ian Reid,Eric P. Xing,Salman Khan,Fahad Shahbaz Khan

from arxiv, Project page: https://amshaker.github.io/Mobile-O/

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/

翻译：统一多模态模型能够在单一架构内同时实现视觉内容的理解与生成。然而，现有模型仍存在数据需求量大、模型体量过重的问题，难以部署于边缘设备。本文提出Mobile-O，一种紧凑的视觉-语言-扩散模型，将统一多模态智能引入移动设备。其核心模块——移动条件投影器（MCP）——采用深度可分离卷积与层级对齐机制，将视觉-语言特征与扩散生成器高效融合。该设计以极低计算成本实现了高效的跨模态条件建模。通过仅数百万样本的训练及新颖的四元组格式（生成提示、图像、问题、答案）后训练，Mobile-O同步增强了视觉理解与生成能力。尽管模型高效轻量，Mobile-O在性能上仍与现有统一模型相当或更优：在GenEval基准上达到74%，分别超越Show-O与JanusFlow模型5%与11%，同时推理速度提升6倍与11倍。在视觉理解任务中，Mobile-O在七项基准测试上的平均性能分别领先15.3%与5.1%。在iPhone设备上仅需约3秒即可生成512×512图像，Mobile-O首次建立了边缘设备上实时统一多模态理解与生成的实用框架。我们希望Mobile-O能推动完全基于设备、无需云端依赖的实时统一多模态智能研究。代码、模型、数据集及移动端应用已公开于https://amshaker.github.io/Mobile-O/。