We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.
翻译:我们介绍了Cosmos 3,这是一个全模态世界模型系列,旨在统一的混合Transformer架构内联合处理与生成语言、图像、视频、音频及动作序列。通过支持高度灵活的输入-输出配置,Cosmos 3无缝统一了物理人工智能的关键模态——有效将视觉语言模型、视频生成器、世界模拟器及世界动作模型融合为一个单一框架。我们的评估表明,Cosmos 3在一系列多样的理解与生成任务上确立了新的最优水平,展示了全模态世界模型作为具身智能体可扩展、通用基础模型的能力。在撰写技术报告时,我们经过后训练的Cosmos 3模型被Artificial Analysis评为最佳开源文生图与图生视频模型,并被RoboArena评为最佳策略模型。为加速物理人工智能领域的开放研究与部署,我们在Linux基金会OpenMDW-1.1许可证下,于https://github.com/nvidia/cosmos 和 https://huggingface.co/collections/nvidia/cosmos3 提供代码、模型检查点、策划的合成数据集及评估基准。项目网站详见https://research.nvidia.com/labs/cosmos-lab/cosmos3。