Vision-language-action (VLA) models have achieved great success on general robotic tasks, but still face challenges in fine-grained spatiotemporal manipulation. Typically, existing methods mainly embed spatiotemporal knowledge into visual and action representations, and directly perform a cross-modal mapping for step-level action prediction. However, such spatiotemporal reasoning remains largely implicit, making it difficult to handle multiple sequential behaviors with explicit spatiotemporal boundaries. In this work, we propose ST-$π$, a structured spatiotemporal VLA model for robotic manipulation. Our model is guided by two key designs: 1) Spatiotemporal VLM. We encode 4D observations and task instructions into latent spaces, and feed them into the LLM to generate a sequence of causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding and temporal grounding. 2) Spatiotemporal action expert. Conditioned on chunk-level action prompts, we design a structured dual-generator guidance to jointly model spatial dependencies and temporal causality, thus predicting step-level action parameters. Within this structured framework, the VLM explicitly plans global spatiotemporal behavior, and the action expert further refines local spatiotemporal control. In addition, we propose a real-world robotic dataset with structured spatiotemporal annotations for fine-tuning. Extensive experiments have been conducted to demonstrate the effectiveness of our model. Our code link: https://github.com/chuanhaoma/ST-pi.
翻译:视觉-语言-动作(VLA)模型在通用机器人任务上取得了显著成功,但仍面临细粒度时空操控的挑战。现有方法通常将时空知识嵌入视觉和动作表征中,并直接执行跨模态映射以进行步骤级动作预测。然而,这种时空推理仍高度隐式,难以处理具有明确时空边界的多步骤连续行为。本文提出ST-$π$,一种面向机器人操控的结构化时空VLA模型。该模型由两个关键设计引导:1)时空视觉-语言模型(VLM):将4D观测和任务指令编码至潜空间,并输入大语言模型(LLM)以生成由因果排序的块级动作提示序列,其包含子任务、空间定位和时序定位;2)时空动作专家:基于块级动作提示,设计结构化双生成器引导机制,联合建模空间依赖性和时间因果性,从而预测步骤级动作参数。在此结构化框架下,VLM显式规划全局时空行为,而动作专家进一步细化局部时空控制。此外,我们提出一个含结构化时空标注的真实世界机器人数据集用于微调。大量实验证明了模型的有效性。代码链接:https://github.com/chuanhaoma/ST-pi。