Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving

Multi-task learning has emerged as a powerful paradigm to solve a range of tasks simultaneously with good efficiency in both computation resources and inference time. However, these algorithms are designed for different tasks mostly not within the scope of autonomous driving, thus making it hard to compare multi-task methods in autonomous driving. Aiming to enable the comprehensive evaluation of present multi-task learning methods in autonomous driving, we extensively investigate the performance of popular multi-task methods on the large-scale driving dataset, which covers four common perception tasks, i.e., object detection, semantic segmentation, drivable area segmentation, and lane detection. We provide an in-depth analysis of current multi-task learning methods under different common settings and find out that the existing methods make progress but there is still a large performance gap compared with single-task baselines. To alleviate this dilemma in autonomous driving, we present an effective multi-task framework, VE-Prompt, which introduces visual exemplars via task-specific prompting to guide the model toward learning high-quality task-specific representations. Specifically, we generate visual exemplars based on bounding boxes and color-based markers, which provide accurate visual appearances of target categories and further mitigate the performance gap. Furthermore, we bridge transformer-based encoders and convolutional layers for efficient and accurate unified perception in autonomous driving. Comprehensive experimental results on the diverse self-driving dataset BDD100K show that the VE-Prompt improves the multi-task baseline and further surpasses single-task models.

翻译：多任务学习已成为同时解决多个任务的高效范式，在计算资源和推理时间方面均表现出良好的效率。然而，现有算法主要针对非自动驾驶领域的任务设计，导致自动驾驶中多任务方法的比较存在困难。为实现对当前自动驾驶多任务学习方法的全面评估，我们深入研究了流行多任务方法在大规模驾驶数据集上的性能表现，该数据集涵盖四项常见感知任务：目标检测、语义分割、可行驶区域分割和车道线检测。通过对不同常见设置下多任务学习方法的深入分析，我们发现现有方法虽有进展，但与单任务基线相比仍存在显著性能差距。为缓解自动驾驶领域的这一困境，我们提出高效的多任务框架VE-Prompt，通过任务特定提示引入视觉示例，引导模型学习高质量的任务特定表征。具体而言，我们基于边界框和颜色标记生成视觉示例，为目标类别提供精确的视觉外观，从而进一步缩小性能差距。此外，我们通过融合Transformer编码器与卷积层，实现自动驾驶中高效且精确的统一感知。在多样化自动驾驶数据集BDD100K上的综合实验结果表明，VE-Prompt不仅改善了多任务基线性能，更超越了单任务模型。