Accurate perception of the dynamic environment is a fundamental task for autonomous driving and robot systems. This paper introduces Let Occ Flow, the first self-supervised work for joint 3D occupancy and occupancy flow prediction using only camera inputs, eliminating the need for 3D annotations. Utilizing TPV for unified scene representation and deformable attention layers for feature aggregation, our approach incorporates a backward-forward temporal attention module to capture dynamic object dependencies, followed by a 3D refine module for fine-gained volumetric representation. Besides, our method extends differentiable rendering to 3D volumetric flow fields, leveraging zero-shot 2D segmentation and optical flow cues for dynamic decomposition and motion optimization. Extensive experiments on nuScenes and KITTI datasets demonstrate the competitive performance of our approach over prior state-of-the-art methods.
翻译:动态环境的精确感知是自动驾驶与机器人系统的一项基础任务。本文提出了Let Occ Flow,这是首个仅使用相机输入、无需三维标注即可联合预测三维占据与占据流的自监督方法。我们的方法利用TPV进行统一场景表示,并采用可变形注意力层进行特征聚合;通过引入前后向时序注意力模块以捕捉动态物体间的依赖关系,随后通过三维优化模块实现精细化的体素表示。此外,本方法将可微分渲染技术扩展至三维体素流场,利用零样本二维分割与光流线索进行动态解耦与运动优化。在nuScenes和KITTI数据集上的大量实验表明,我们的方法相较于现有先进技术具有竞争优势。