While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.
翻译:尽管端到端视觉-语言-动作(VLA)模型为机器人操作提供了有前景的范式,但在狭窄的控制数据上微调这些模型通常会损害其基础视觉-语言模型(VLM)所继承的深层推理能力。为解决这一根本性权衡,我们提出HiVLA——一种以视觉为中心的层次化框架,明确将高层语义规划与低层运动控制解耦。在高层部分,VLM规划器首先执行任务分解与视觉定位,生成包含子任务指令和精确目标边界框的结构化规划。随后,为将该规划转化为物理动作,我们在低层部分引入配备新颖级联交叉注意力机制的流匹配扩散变换器(DiT)动作专家。该设计依次融合全局上下文、高分辨率目标中心裁剪图像及技能语义,使DiT能够专注于稳健执行。我们的解耦架构在保持VLM零样本推理能力的同时,允许两个组件的独立改进。在仿真和真实环境中的大量实验表明,HiVLA显著优于最先进的端到端基线方法,尤其在长时域技能组合与杂乱场景中小物体的精细操作方面表现卓越。