Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.
翻译:自动驾驶需要对几何结构和语义信息随时间的变化进行预测,以有效推理未来环境状态。现有的基于视觉的占据预测方法主要关注与运动相关的类别(如静态和动态物体),而语义信息在很大程度上仍然缺失。近期的语义占据预测方法试图填补这一空白,但依赖于从独立网络获取的历史占据预测结果。这使得现有方法容易受到误差累积的影响,并且无法直接从图像中学习时空特征。本文中,我们提出了ForecastOcc,这是首个基于视觉的语义占据预测框架,能够联合预测未来的占据状态和语义类别。我们的框架直接从历史相机图像中生成多时间跨度的语义占据预测,无需依赖外部估计的地图。我们在两个互补的场景下评估ForecastOcc:在Occ3D-nuScenes数据集上进行多视角预测,以及在SemanticKITTI数据集上进行单目预测——我们在后者上为此任务建立了首个基准。我们通过在本框架内适配两个2D预测模块,首次引入了基线方法。重要的是,我们提出了一种新颖的架构,该架构包含一个时序交叉注意力预测模块、一个2D到3D的视图变换器、一个用于占据预测的3D编码器,以及一个用于跨多个时间跨度生成体素级预测的语义占据头。在两个数据集上的大量实验表明,ForecastOcc始终优于基线方法,能够产生语义丰富、具有未来感知的预测,这些预测捕捉了对自动驾驶至关重要的场景动态和语义信息。