We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.
翻译:我们提出JoyAI-Image——一个统一的多模态基础模型,用于视觉理解、文本到图像生成以及指令引导的图像编辑。JoyAI-Image将空间增强的多模态大语言模型(MLLM)与多模态扩散变换器(MMDiT)相结合,使感知与生成能够通过共享多模态接口进行交互。基于这一架构,我们构建了一个可扩展的训练方案,该方案融合了统一指令微调、长文本渲染监督、空间标注数据以及通用与空间编辑信号。这一设计使模型具备广泛的多模态能力,同时强化了基于几何的推理和可控视觉合成。在理解、生成、长文本渲染和编辑基准测试上的实验表明,JoyAI-Image达到了最先进或极具竞争力的性能。更重要的是,增强理解、可控空间编辑与新视角辅助推理之间的双向循环使模型能够超越通用视觉能力,迈向更强的空间智能。这些结果揭示了统一视觉模型在下游应用(如视觉-语言-动作系统和世界模型)中的有前途的发展方向。