Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning

Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardless of the model's evolving capabilities. This static paradigm is inherently data-inefficient: training capacity is often spent on samples that are either trivial or overly difficult for the model at its current stage. To address this limitation, we propose Ouroboros-Spatial, a self-evolving training framework in which the model plays dual roles as a proposer and a solver. In each iteration, a frozen proposer generates spatial question-answer (QA) pairs from 3D scene metadata and raw video frames, together with executable code for deriving reliable ground truth. A learnable solver is then fine-tuned on the accepted samples, and its per-sample prediction confidence is used as a difficulty signal. This signal is fed back to the proposer in the next iteration, guiding it to generate questions better matched to the solver's current capabilities. Through this closed-loop design, the training distribution co-evolves with model ability, reducing redundant trivial examples while filtering out ambiguous or uninformative samples with limited learning value. Across six spatial reasoning benchmarks, Ouroboros-Spatial substantially improves Qwen3-VL-4B and Qwen3-VL-8B while using an order of magnitude fewer training examples than recent large-scale curated datasets. On VSI-Bench, it yields absolute gains of 9.9 and 6.8 points for the 4B and 8B models, respectively, enabling both to outperform a wide range of strong open-source and proprietary baselines.

翻译：空间推理仍然是多模态大语言模型（MLLMs）面临的一个持续挑战。现有方法主要依赖于大规模、静态策划的数据集，其中所有训练样本被统一对待，而不考虑模型不断发展的能力。这种静态范式本质上数据效率低下：训练能力往往浪费在那些对于模型当前阶段而言要么过于简单、要么过于困难的样本上。为克服这一局限，我们提出Ouroboros-Spatial，一个自我演进的训练框架，其中模型扮演双重角色：提议者和求解者。在每次迭代中，一个冻结的提议者从3D场景元数据和原始视频帧中生成空间问答（QA）对，以及用于推导可靠真实值的可执行代码。然后，一个可学习的求解者在被接受的样本上进行微调，其每样本预测置信度被用作一个难度信号。此信号在下一迭代中反馈给提议者，指导其生成更匹配求解者当前能力的问题。通过这种闭环设计，训练分布与模型能力共同演进，减少了冗余的简单样本，同时过滤掉具有有限学习价值的模糊或信息量少的样本。在六个空间推理基准上，Ouroboros-Spatial显著提升了Qwen3-VL-4B和Qwen3-VL-8B的性能，同时所使用的训练样本量比近期大规模策划数据集少一个数量级。在VSI-Bench上，对于4B和8B模型，其分别带来了9.9和6.8个百分点的绝对提升，使两者均能超越众多强大的开源和专有基线模型。