Vision-Language-Action (VLA) models leverage Multimodal Large Language Models (MLLMs) for robotic control, but recent studies reveal that MLLMs exhibit limited spatial intelligence due to training predominantly on 2D data, resulting in inadequate 3D perception for manipulation tasks. While recent approaches incorporate specialized 3D vision models such as VGGT to enhance spatial understanding, they employ diverse integration mechanisms without systematic investigation, leaving the optimal fusion strategy unclear. We conduct a comprehensive pilot study comparing nine VGGT integration schemes on standardized benchmarks and find that semantic-conditioned gated fusion, which adaptively balances 2D semantic and 3D geometric features based on task context, achieved the strongest performance among all nine evaluated fusion schemes in our pilot study. We present 3D-Mix, a plug-and-play module that integrates into diverse VLA architectures (GR00T-style and $π$-style) without modifying existing MLLM or action expert components. Experiments across six MLLM series (nine model variants, 2B--8B parameters) on SIMPLER and LIBERO show that 3D-Mix delivers consistent performance gains, averaging +7.0% on the out-of-domain (OOD) SIMPLER benchmark across all nine GR00T-style variants, establishing a principled approach for enhancing spatial intelligence in VLA systems.
翻译:视觉-语言-动作(VLA)模型利用多模态大语言模型(MLLM)进行机器人控制,但近期研究表明,由于MLLM主要在2D数据上训练,其空间智能有限,导致在操作任务中缺乏足够的3D感知能力。尽管现有方法通过引入VGGT等专用3D视觉模型来增强空间理解,但这些方法采用各异的集成机制且缺乏系统研究,使得最优融合策略尚不明确。我们开展了一项全面的预研究,在标准化基准上比较了九种VGGT集成方案,发现语义条件门控融合(该机制基于任务上下文自适应平衡2D语义特征与3D几何特征)在所有九种评估的融合方案中取得了最优性能。我们提出3D-Mix,这是一种即插即用模块,可在不修改现有MLLM或动作专家组件的情况下集成到多种VLA架构(GR00T风格和$π$风格)中。在SIMPLER和LIBERO基准上,针对六个MLLM系列(九种模型变体,参数规模2B-8B)的实验表明,3D-Mix能够带来一致的性能提升,在所有九种GR00T风格变体上,域外(OOD)SIMPLER基准的平均提升幅度达+7.0%,从而为增强VLA系统的空间智能建立了一种原则性方法。