Learning world models can teach an agent how the world works in an unsupervised manner. Even though it can be viewed as a special case of sequence modeling, progress for scaling world models on robotic applications such as autonomous driving has been somewhat less rapid than scaling language models with Generative Pre-trained Transformers (GPT). We identify two reasons as major bottlenecks: dealing with complex and unstructured observation space, and having a scalable generative model. Consequently, we propose a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. To efficiently decode and denoise tokens in parallel, we recast Masked Generative Image Transformer into the discrete diffusion framework with a few simple changes, resulting in notable improvement. When applied to learning world models on point cloud observations, our model reduces prior SOTA Chamfer distance by more than 65% for 1s prediction, and more than 50% for 3s prediction, across NuScenes, KITTI Odometry, and Argoverse2 datasets. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotic agents.
翻译:学习世界模型可以让智能体以无监督方式理解世界的运作规律。尽管这可以视为序列建模的特殊案例,但相较于通过生成式预训练Transformer(GPT)扩展语言模型的进展,在自动驾驶等机器人应用中扩展世界模型的进展较为缓慢。我们识别出两大主要瓶颈:处理复杂非结构化的观测空间,以及构建可扩展的生成模型。为此,我们提出一种新颖的世界建模方法,该方法首先通过VQVAE对传感器观测进行标记化编码,随后利用离散扩散预测未来状态。为实现并行高效的去噪与解码,我们通过简单改进将掩码生成式图像Transformer重构为离散扩散框架,取得了显著性能提升。将该方法应用于基于点云观测的世界模型学习时,在NuScenes、KITTI Odometry和Argoverse2数据集上,我们的模型在1秒预测和3秒预测任务中分别将先前最优方法的Chamfer距离降低超过65%和50%。实验结果表明,对标记化的智能体经验进行离散扩散处理,能够释放类GPT无监督学习在机器人智能体中的潜力。