Multi-task visual perception has a wide range of applications in scene understanding such as autonomous driving. In this work, we devise an efficient unified framework to solve multiple common perception tasks, including instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation. Simply sharing the same visual feature representations for these tasks impairs the performance of tasks, while independent task-specific feature extractors lead to parameter redundancy and latency. Thus, we design two feature-merge branches to learn feature basis, which can be useful to, and thus shared by, multiple perception tasks. Then, each task takes the corresponding feature basis as the input of the prediction task head to fulfill a specific task. In particular, one feature merge branch is designed for instance-level recognition the other for dense predictions. To enhance inter-branch communication, the instance branch passes pixel-wise spatial information of each instance to the dense branch using efficient dynamic convolution weighting. Moreover, a simple but effective dynamic routing mechanism is proposed to isolate task-specific features and leverage common properties among tasks. Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception. In addition, as tasks benefit from co-training with each other, our solution achieves on par results on partially labeled settings on nuScenes and outperforms previous works for 3D detection and depth estimation on the Cityscapes dataset with full supervision.
翻译:多任务视觉感知在场景理解(如自动驾驶)中具有广泛应用。本文设计了一个高效统一的框架,用于解决包括实例分割、语义分割、单目3D检测和深度估计在内的多个常见感知任务。简单共享相同的视觉特征表示会损害任务性能,而独立的任务专用特征提取器则会导致参数冗余和延迟。因此,我们设计了两个特征融合分支来学习特征基元,这些基元可被多个感知任务共享。随后,每个任务将对应的特征基元作为预测任务头的输入,以完成特定任务。具体而言,一个特征融合分支用于实例级识别,另一个用于密集预测。为增强分支间通信,实例分支通过高效的动态卷积加权将每个实例的像素级空间信息传递给密集分支。此外,我们提出一种简单但有效的动态路由机制,用于隔离任务专用特征并利用任务间的共有属性。所提出的框架(称为D2BNet)为多任务感知的参数高效预测提供了一种独特方法。实验表明,由于任务间可通过联合训练相互促进,我们方法在部分标注设置下的nuScenes数据集上取得与现有方法相当的结果,并在Cityscapes数据集的全监督设置下,在3D检测和深度估计任务上优于先前工作。