DrivingWorld：通过视频GPT构建自动驾驶世界模型 (DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT)

Recent successes in autoregressive (AR) generation models, such as the GPT series in natural language processing, have motivated efforts to replicate this success in visual tasks. Some works attempt to extend this approach to autonomous driving by building video-based world models capable of generating realistic future video sequences and predicting ego states. However, prior works tend to produce unsatisfactory results, as the classic GPT framework is designed to handle 1D contextual information, such as text, and lacks the inherent ability to model the spatial and temporal dynamics essential for video generation. In this paper, we present DrivingWorld, a GPT-style world model for autonomous driving, featuring several spatial-temporal fusion mechanisms. This design enables effective modeling of both spatial and temporal dynamics, facilitating high-fidelity, long-duration video generation. Specifically, we propose a next-state prediction strategy to model temporal coherence between consecutive frames and apply a next-token prediction strategy to capture spatial information within each frame. To further enhance generalization ability, we propose a novel masking strategy and reweighting strategy for token prediction to mitigate long-term drifting issues and enable precise control. Our work demonstrates the ability to produce high-fidelity and consistent video clips of over 40 seconds in duration, which is over 2 times longer than state-of-the-art driving world models. Experiments show that, in contrast to prior works, our method achieves superior visual quality and significantly more accurate controllable future video generation. Our code is available at https://github.com/YvanYin/DrivingWorld.

翻译：近年来，自回归生成模型（如自然语言处理中的GPT系列）取得的成功，激励了研究者尝试在视觉任务中复现这一成就。部分工作试图将这一方法扩展到自动驾驶领域，通过构建基于视频的世界模型，以生成逼真的未来视频序列并预测自车状态。然而，现有方法往往产生不尽人意的结果，因为经典的GPT框架是为处理文本等一维上下文信息而设计的，缺乏对视频生成至关重要的时空动态进行建模的内在能力。本文提出DrivingWorld，一种用于自动驾驶的GPT风格世界模型，其引入了多种时空融合机制。该设计能够有效建模时空动态，从而支持高保真、长时长的视频生成。具体而言，我们提出一种下一状态预测策略来建模连续帧之间的时序连贯性，并应用一种下一标记预测策略来捕捉每帧内的空间信息。为了进一步增强泛化能力，我们提出了一种新颖的标记预测掩码策略和重加权策略，以缓解长期漂移问题并实现精确控制。我们的工作展示了能够生成时长超过40秒的高保真且连贯的视频片段，这超过了当前最先进驾驶世界模型时长的2倍以上。实验表明，与先前工作相比，我们的方法实现了更优的视觉质量以及显著更准确的可控未来视频生成。我们的代码可在 https://github.com/YvanYin/DrivingWorld 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日