Long-horizon robotic manipulation requires bridging the gap between high-level planning (System 2) and low-level control (System 1). Current Vision-Language-Action (VLA) models often entangle these processes, performing redundant multimodal reasoning at every timestep, which leads to high latency and goal instability. To address this, we present StreamVLA, a dual-system architecture that unifies textual task decomposition, visual goal imagination, and continuous action generation within a single parameter-efficient backbone. We introduce a "Lock-and-Gated" mechanism to intelligently modulate computation: only when a sub-task transition is detected, the model triggers slow thinking to generate a textual instruction and imagines the specific visual completion state, rather than generic future frames. Crucially, this completion state serves as a time-invariant goal anchor, making the policy robust to execution speed variations. During steady execution, these high-level intents are locked to condition a Flow Matching action head, allowing the model to bypass expensive autoregressive decoding for 72% of timesteps. This hierarchical abstraction ensures sub-goal focus while significantly reducing inference latency. Extensive evaluations demonstrate that StreamVLA achieves state-of-the-art performance, with a 98.5% success rate on the LIBERO benchmark and robust recovery in real-world interference scenarios, achieving a 48% reduction in latency compared to full-reasoning baselines.
翻译:长时程机器人操作需要弥合高层规划(系统2)与底层控制(系统1)之间的鸿沟。当前的视觉-语言-动作(VLA)模型常将这两个过程纠缠在一起,在每个时间步执行冗余的多模态推理,导致高延迟与目标不稳定。为解决此问题,我们提出StreamVLA,一种双系统架构,在单一参数高效的主干网络中统一了文本任务分解、视觉目标想象与连续动作生成。我们引入一种“锁定-门控”机制来智能调节计算:仅当检测到子任务转换时,模型才触发慢思考以生成文本指令,并想象具体的视觉完成状态,而非通用的未来帧。关键的是,该完成状态作为时间不变的目标锚点,使策略对执行速度变化具有鲁棒性。在稳定执行期间,这些高层意图被锁定以调节流匹配动作头,使模型能在72%的时间步中绕过昂贵的自回归解码。这种层次化抽象确保了子目标聚焦,同时显著降低了推理延迟。大量实验评估表明,StreamVLA实现了最先进的性能,在LIBERO基准测试中达到98.5%的成功率,并在真实世界干扰场景中表现出稳健的恢复能力,与全推理基线相比延迟降低了48%。