OmniDet: Surround View Cameras based Multi-task Visual Perception Network for Autonomous Driving

Surround View fisheye cameras are commonly deployed in automated driving for 360\deg{} near-field sensing around the vehicle. This work presents a multi-task visual perception network on unrectified fisheye images to enable the vehicle to sense its surrounding environment. It consists of six primary tasks necessary for an autonomous driving system: depth estimation, visual odometry, semantic segmentation, motion segmentation, object detection, and lens soiling detection. We demonstrate that the jointly trained model performs better than the respective single task versions. Our multi-task model has a shared encoder providing a significant computational advantage and has synergized decoders where tasks support each other. We propose a novel camera geometry based adaptation mechanism to encode the fisheye distortion model both at training and inference. This was crucial to enable training on the WoodScape dataset, comprised of data from different parts of the world collected by 12 different cameras mounted on three different cars with different intrinsics and viewpoints. Given that bounding boxes is not a good representation for distorted fisheye images, we also extend object detection to use a polygon with non-uniformly sampled vertices. We additionally evaluate our model on standard automotive datasets, namely KITTI and Cityscapes. We obtain the state-of-the-art results on KITTI for depth estimation and pose estimation tasks and competitive performance on the other tasks. We perform extensive ablation studies on various architecture choices and task weighting methodologies. A short video at https://youtu.be/xbSjZ5OfPes provides qualitative results.

翻译：环视鱼眼摄像头通常用于自动驾驶的360°近场感知。本文提出了一种基于未校正鱼眼图像的多任务视觉感知网络，使车辆能够感知周围环境。该网络包含自动驾驶系统所需的六项主要任务：深度估计、视觉里程计、语义分割、运动分割、目标检测和镜头污垢检测。我们证明了联合训练模型比相应单任务版本的性能更优。本文的多任务模型具有共享编码器，显著提升了计算效率，并采用协同解码器实现任务间的相互支持。我们提出了一种新颖的基于相机几何的自适应机制，在训练和推理阶段均编码鱼眼畸变模型。这对于在WoodScape数据集上训练至关重要，该数据集包含全球不同地区的数据，由安装在三辆不同车辆上的12个具有不同内参和视角的摄像头采集。鉴于边界框不适合表示畸变鱼眼图像，我们将目标检测扩展为使用非均匀采样顶点的多边形表示。此外，我们在标准自动驾驶数据集（如KITTI和Cityscapes）上评估了模型，在KITTI深度估计和姿态估计任务上取得了最优结果，并在其他任务上展现出有竞争力的性能。我们对多种架构选择和任务加权方法进行了广泛的消融研究。定性结果可参见简短视频https://youtu.be/xbSjZ5OfPes。