In this report, we present our solution to the multi-task robustness track of the 1st Visual Continual Learning (VCL) Challenge at ICCV 2023 Workshop. We propose a vanilla framework named UniNet that seamlessly combines various visual perception algorithms into a multi-task model. Specifically, we choose DETR3D, Mask2Former, and BinsFormer for 3D object detection, instance segmentation, and depth estimation tasks, respectively. The final submission is a single model with InternImage-L backbone, and achieves a 49.6 overall score (29.5 Det mAP, 80.3 mTPS, 46.4 Seg mAP, and 7.93 silog) on SHIFT validation set. Besides, we provide some interesting observations in our experiments which may facilitate the development of multi-task learning in dense visual prediction.
翻译:本报告介绍了我们在ICCV 2023研讨会首届视觉持续学习挑战赛多任务鲁棒性赛道中的解决方案。我们提出一个名为UniNet的通用框架,该框架将多种视觉感知算法无缝融合为多任务模型。具体而言,我们分别选用DETR3D、Mask2Former和BinsFormer执行三维目标检测、实例分割及深度估计任务。最终提交方案采用InternImage-L主干网络的单模型架构,在SHIFT验证集上取得49.6综合得分(检测mAP 29.5、mTPS 80.3、分割mAP 46.4、silog 7.93)。此外,我们在实验中观察到若干有趣现象,这些发现可能推动密集视觉预测领域中多任务学习的发展。