3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.

翻译：基于可验证奖励的强化学习（RLVR）已成为提升大语言模型（LLMs）推理能力的变革性范式，但其在三维场景理解中的潜力尚未充分发掘。现有方法主要依赖监督微调（SFT），其采用词元级交叉熵损失作为间接优化代理，导致训练目标与任务性能之间存在失配。为弥合这一差距，我们提出面向视频三维场景理解的重力强化微调方法（3D-RFT），这是首个将RLVR扩展到视频三维感知与推理的框架。3D-RFT通过直接优化模型以逼近评估指标实现范式转换——首先利用SFT激活具有三维感知能力的多模态大语言模型（MLLMs），再通过具有严格可验证奖励函数的组相对策略优化（GRPO）进行强化微调。我们基于三维交并比（3D IoU）和F1分数等指标设计任务专属奖励函数，为模型训练提供更有效的引导信号。大量实验表明，3D-RFT-4B在各种视频三维场景理解任务上达到最先进水平。值得注意的是，3D-RFT-4B在三维视频检测、三维视觉定位及空间推理基准上显著超越更大规模模型（如VG-LLM-8B）。我们进一步揭示了3D-RFT的良好特性（如鲁棒效能）以及关于训练策略与数据影响的深刻洞见。希望3D-RFT能成为推动三维场景理解未来发展的稳健且富有前景的范式。