Enhancing the spatial reasoning capabilities of Multi-modal Large Language Models (MLLMs) for video understanding is crucial yet challenging. We present Spatial-R1, a targeted approach involving two key contributions: the curation of SR, a new video spatial reasoning dataset from ScanNet with automatically generated QA pairs across seven task types, and the application of Task-Specific Group Relative Policy Optimization (GRPO) for fine-tuning. By training the Qwen2.5-VL-7B-Instruct model on SR using GRPO, Spatial-R1 significantly advances performance on the VSI-Bench benchmark, achieving a 7.4\% gain over the baseline and outperforming strong contemporary models. This work validates the effectiveness of specialized data curation and optimization techniques for improving complex spatial reasoning in video MLLMs.
翻译:提升多模态大语言模型(MLLMs)在视频理解中的空间推理能力至关重要且具有挑战性。本文提出Spatial-R1,一种针对性方法,包含两项关键贡献:一是构建了SR数据集——一个基于ScanNet自动生成七类任务问答对的新型视频空间推理数据集;二是应用任务特定组相对策略优化(GRPO)进行微调。通过在SR数据集上使用GRPO训练Qwen2.5-VL-7B-Instruct模型,Spatial-R1在VSI-Bench基准测试中取得显著性能提升,较基线提高7.4%,并超越同期强效模型。本工作验证了专业化数据构建与优化技术对提升视频MLLMs复杂空间推理能力的有效性。