Visual-Language-Action (VLA) models represent a paradigm shift in embodied AI, yet existing frameworks often struggle with imprecise spatial perception, suboptimal multimodal fusion, and instability in reinforcement learning. To bridge these gaps, we propose OmniVLA-RL, a novel architecture that leverages a Mix-of-Transformers (MoT) design to synergistically integrate reasoning, spatial, and action experts. Furthermore, we introduce Flow-GSPO, which reformulates flow matching as a Stochastic Differential Equation (SDE) process and integrates it with Group Segmented Policy Optimization (GSPO) to enhance action precision and training robustness. Extensive evaluations on the LIBERO and LIBERO-Plus benchmarks demonstrate that OmniVLA-RL achieves decent overall performance and surpasses mainstream existing methods, effectively overcoming the fundamental limitations of current VLA models.
翻译:视觉-语言-动作模型代表了具身智能领域的范式性转变,然而现有框架常受困于不精确的空间感知、次优的多模态融合以及强化学习的不稳定性。为弥合这些不足,我们提出OmniVLA-RL,一种新颖的架构,其采用混合变换器设计,协同整合推理、空间与动作专家模块。此外,我们引入Flow-GSPO,该方法将流匹配重构为随机微分方程过程,并将其与分组分段策略优化相结合,以增强动作精度和训练鲁棒性。在LIBERO与LIBERO-Plus基准上的广泛评估表明,OmniVLA-RL取得了相当优异的整体性能,超越了主流现有方法,有效克服了当前VLA模型的基本局限。