SAMPO：基于运动提示的多尺度自回归生成世界模型 (SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models)

World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.

翻译：世界模型使智能体能够在想象环境中模拟行动后果，用于规划、控制和长时域决策。然而，现有自回归世界模型因空间结构破坏、解码效率低下和运动建模不足而难以实现视觉连贯的预测。为此，我们提出基于运动提示的多尺度自回归模型（SAMPO），该混合框架将帧内生成的视觉自回归建模与帧间生成的因果建模相结合。具体而言，SAMPO将时序因果解码与双向空间注意力机制相集成，既保持了空间局部性，又支持各尺度内的并行解码。这一设计显著提升了时序一致性和推演效率。为进一步增强动态场景理解，我们设计了非对称多尺度分词器，在观测帧中保留空间细节的同时为未来帧提取紧凑的动态表示，从而优化内存使用和模型性能。此外，我们引入了轨迹感知运动提示模块，该模块注入关于物体和机器人轨迹的时空线索，将注意力聚焦于动态区域，提升了时序一致性和物理真实性。大量实验表明，SAMPO在动作条件视频预测和基于模型的控制任务中取得具有竞争力的性能，在推理速度提升4.4倍的同时改善了生成质量。我们还评估了SAMPO的零样本泛化能力和缩放特性，证明了其向未见任务泛化的能力以及从更大模型规模中获益的潜力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

【KDD 2020】M2GRL: 一个多任务多视角图表示学习框架的Web-scale的推荐系统，M2GRL: A Multi-task Multi-view Graph Representation Learning Framework for Web-scale Recommender Systems

专知会员服务

29+阅读 · 2020年6月30日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日