Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.
翻译:近期多模态大语言模型在统一文本与图像理解及生成方面取得了显著性能,但将此类原生能力扩展到三维领域仍因数据有限而面临挑战。相较于丰富的二维图像,高质量三维资产稀缺,导致三维合成约束不足。现有方法通常依赖间接管线——在二维中进行编辑并通过优化将结果提升至三维空间,但会牺牲几何一致性。我们提出Omni123,一个三维原生基础模型,在单一自回归框架内统一了文本到二维与文本到三维的生成。我们的核心见解在于:图像与三维之间的跨模态一致性可作为隐式结构约束。通过将文本、图像和三维表示为共享序列空间中的离散令牌,模型利用丰富的二维数据作为几何先验以改进三维表示。我们引入交错式X到X训练范式,该范式在异构配对数据集上协调多样化的跨模态任务,且无需完全对齐的文本-图像-三维三元组。通过自回归序列中遍历语义-视觉-几何循环(例如,从文本到图像到三维再到图像),模型联合强化了语义对齐、外观保真度及多视角几何一致性。实验表明,Omni123显著提升了文本引导的三维生成与编辑性能,为构建多模态三维世界模型开辟了可扩展路径。