This report serves as a supplementary document for TaskPrompter, detailing its implementation on a new joint 2D-3D multi-task learning benchmark based on Cityscapes-3D. TaskPrompter presents an innovative multi-task prompting framework that unifies the learning of (i) task-generic representations, (ii) task-specific representations, and (iii) cross-task interactions, as opposed to previous approaches that separate these learning objectives into different network modules. This unified approach not only reduces the need for meticulous empirical structure design but also significantly enhances the multi-task network's representation learning capability, as the entire model capacity is devoted to optimizing the three objectives simultaneously. TaskPrompter introduces a new multi-task benchmark based on Cityscapes-3D dataset, which requires the multi-task model to concurrently generate predictions for monocular 3D vehicle detection, semantic segmentation, and monocular depth estimation. These tasks are essential for achieving a joint 2D-3D understanding of visual scenes, particularly in the development of autonomous driving systems. On this challenging benchmark, our multi-task model demonstrates strong performance compared to single-task state-of-the-art methods and establishes new state-of-the-art results on the challenging 3D detection and depth estimation tasks.
翻译:本报告作为TaskPrompter的补充文档,详细阐述了其在基于Cityscapes-3D的新型联合2D-3D多任务学习基准上的实现。TaskPrompter提出了一种创新的多任务提示框架,将(i)任务通用表征、(ii)任务特定表征及(iii)跨任务交互的学习统一起来,而此前的方法将这些学习目标分离至不同网络模块。这种统一方法不仅降低了对精细经验性结构设计的需求,还显著提升了多任务网络的表征学习能力——因为整个模型容量被完全用于同时优化这三个目标。TaskPrompter基于Cityscapes-3D数据集引入了新型多任务基准,要求多任务模型同时生成单目三维车辆检测、语义分割和单目深度估计的预测结果。这些任务对于实现视觉场景的联合2D-3D理解至关重要,尤其在自动驾驶系统开发领域。在该具有挑战性的基准上,我们的多任务模型相比单任务最优方法展现出强劲性能,并在极具挑战的三维检测与深度估计任务上创造了新的最佳结果。