Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence. In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the SenseNova-SI family, built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel). We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M: eight million diverse data samples under a rigorous taxonomy of spatial capabilities. SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.8% on VSI-Bench, 43.3% on MMSI, 85.7% on MindCube, 54.7% on ViewSpatial, 47.7% on SITE, 63.9% on BLINK, 55.5% on 3DSR, and 72.0% on EmbSpatial, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En). More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training, analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. All newly trained multimodal foundation models are publicly released.
翻译:尽管取得了显著进展,多模态基础模型在空间智能方面仍暴露出明显不足。本研究探索通过扩展多模态基础模型来培育空间智能,基于SenseNova-SI系列模型,该系列建立在包括视觉理解模型(即Qwen3-VL和InternVL3)以及统一理解与生成模型(即Bagel)等成熟多模态基础之上。我们采用原则性方法构建高性能且稳健的空间智能,通过系统整理SenseNova-SI-8M数据集:基于严格的空间能力分类体系,包含八百万个多样化数据样本。SenseNova-SI在多个空间智能基准测试中展现了前所未有的性能:VSI-Bench达68.8%、MMSI达43.3%、MindCube达85.7%、ViewSpatial达54.7%、SITE达47.7%、BLINK达63.9%、3DSR达55.5%、EmbSpatial达72.0%,同时保持强大的通用多模态理解能力(如MMBench-En达84.9%)。更重要的是,我们分析了数据规模的影响,讨论了多样化数据训练所催生的早期涌现泛化能力迹象,分析了过拟合与语言捷径的风险,开展了空间思维链推理的初步研究,并验证了潜在的下游应用。所有新训练的多模态基础模型均已公开发布。