Accurate perception of dynamic traffic scenes is crucial for high-level autonomous driving systems, requiring robust object motion estimation and instance segmentation. However, traditional methods often treat them as separate tasks, leading to suboptimal performance, spatio-temporal inconsistencies, and inefficiency in complex scenarios due to the absence of information sharing. This paper proposes a multi-task SemanticFlow framework to simultaneously predict scene flow and instance segmentation of full-resolution point clouds. The novelty of this work is threefold: 1) developing a coarse-to-fine prediction based multi-task scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self-supervised learning scheme, which utilizes coarse segmentation to detect rigid objects and compute their transformation matrices between sequential frames, enabling the generation of self-supervised labels. The proposed framework is validated on the Argoverse and Waymo datasets, demonstrating superior performance in instance segmentation accuracy, scene flow estimation, and computational efficiency, establishing a new benchmark for self-supervised methods in dynamic scene understanding.
翻译:动态交通场景的精确感知对于高级自动驾驶系统至关重要,这需要鲁棒的目标运动估计与实例分割能力。然而,传统方法通常将二者视为独立任务,由于缺乏信息共享,在复杂场景中易导致性能欠佳、时空不一致及效率低下。本文提出一种多任务语义流框架,可同步预测全分辨率点云的场景流与实例分割。本工作的创新性体现在三个方面:1)提出一种基于由粗到细预测的多任务方案,其中静态背景与动态对象的初始粗分割通过共享特征处理模块为优化运动与语义信息提供上下文;2)设计一组损失函数以增强场景流估计与实例分割的性能,同时有助于确保交通场景中静态与动态对象的时空一致性;3)构建一种自监督学习方案,利用粗分割检测刚性物体并计算其在连续帧间的变换矩阵,从而生成自监督标签。所提框架在Argoverse与Waymo数据集上得到验证,在实例分割精度、场景流估计与计算效率方面均展现出优越性能,为动态场景理解的自监督方法建立了新基准。