Robotic manipulation critically requires reasoning about future spatial-temporal interactions, yet existing VLA policies and world-model-enhanced policies do not fully model action-relevant spatial-temporal interaction structure. We propose STARRY, a world-model-enhanced action-generation policy that aligns spatial-temporal prediction with action generation. STARRY jointly denoises future spatial-temporal latents and action sequences, and introduces Geometry-Aware Selective Attention Modulation to convert predicted depth and end-effector geometry into token-aligned weights for selective action-attention modulation. On RoboTwin 2.0, STARRY achieves 93.82% / 93.30% average success under Clean and Randomized settings. Real-world experiments further improve average success from 42.5% to 70.8% over $π_{0.5}$, demonstrating the effectiveness of action-centric spatial-temporal world modeling for spatial-temporally demanding robotic action generation.
翻译:机器人操作核心需要推理未来的空间-时间交互,但现有的视觉-语言-动作(VLA)策略与世界模型增强策略未能充分建模与动作相关的空间-时间交互结构。我们提出STARRY,一种世界模型增强的动作生成策略,将空间-时间预测与动作生成对齐。STARRY联合去噪未来的空间-时间潜变量和动作序列,并引入几何感知选择性注意力调制,将预测的深度和末端执行器几何转换为令牌对齐的权重,用于选择性动作注意力调制。在RoboTwin 2.0上,STARRY在Clean和Randomized设置下分别达到93.82%/93.30%的平均成功率。真实世界实验进一步将$π_{0.5}$的平均成功率从42.5%提升至70.8%,证明了以动作为中心的空间-时间世界建模对高空间-时间需求的机器人动作生成的有效性。