With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.
翻译:随着视觉-语言-动作模型和世界模型在自动驾驶系统中的日益普及,可扩展的图像标记化作为视觉模态的接口变得至关重要。然而,现有的大多数标记器是为单目和二维场景设计的,应用于高分辨率多视图驾驶场景时会导致效率低下和视图间不一致。为解决这一问题,我们提出DriveTok,一种高效的三维驾驶场景标记器,用于统一的多视图重建与理解。DriveTok首先从视觉基础模型中获取语义丰富的视觉特征,然后通过三维可变形交叉注意力将其转换为场景标记。在解码阶段,我们采用多视图变换器从场景标记中重建多视图特征,并使用多个头获取RGB、深度和语义重建结果。此外,我们直接在场景标记上添加三维头进行三维语义占用预测,以提升空间感知能力。通过多种训练目标,DriveTok学习到统一场景标记,融合语义、几何和纹理信息,实现高效的多视图标记化。在广泛使用的nuScenes数据集上的大量实验表明,DriveTok的场景标记在图像重建、语义分割、深度预测和三维占用预测任务中表现优异。