World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.
翻译:世界模型使智能体能够在想象环境中模拟行动后果,用于规划、控制和长时域决策。然而,现有自回归世界模型因空间结构破坏、解码效率低下和运动建模不足而难以实现视觉连贯的预测。为此,我们提出基于运动提示的多尺度自回归模型(SAMPO),该混合框架将帧内生成的视觉自回归建模与帧间生成的因果建模相结合。具体而言,SAMPO将时序因果解码与双向空间注意力机制相集成,既保持了空间局部性,又支持各尺度内的并行解码。这一设计显著提升了时序一致性和推演效率。为进一步增强动态场景理解,我们设计了非对称多尺度分词器,在观测帧中保留空间细节的同时为未来帧提取紧凑的动态表示,从而优化内存使用和模型性能。此外,我们引入了轨迹感知运动提示模块,该模块注入关于物体和机器人轨迹的时空线索,将注意力聚焦于动态区域,提升了时序一致性和物理真实性。大量实验表明,SAMPO在动作条件视频预测和基于模型的控制任务中取得具有竞争力的性能,在推理速度提升4.4倍的同时改善了生成质量。我们还评估了SAMPO的零样本泛化能力和缩放特性,证明了其向未见任务泛化的能力以及从更大模型规模中获益的潜力。