Video instance segmentation, also known as multi-object tracking and segmentation, is an emerging computer vision research area introduced in 2019, aiming at detecting, segmenting, and tracking instances in videos simultaneously. By tackling the video instance segmentation tasks through effective analysis and utilization of visual information in videos, a range of computer vision-enabled applications (e.g., human action recognition, medical image processing, autonomous vehicle navigation, surveillance, etc) can be implemented. As deep-learning techniques take a dominant role in various computer vision areas, a plethora of deep-learning-based video instance segmentation schemes have been proposed. This survey offers a multifaceted view of deep-learning schemes for video instance segmentation, covering various architectural paradigms, along with comparisons of functional performance, model complexity, and computational overheads. In addition to the common architectural designs, auxiliary techniques for improving the performance of deep-learning models for video instance segmentation are compiled and discussed. Finally, we discuss a range of major challenges and directions for further investigations to help advance this promising research field.
翻译:视频实例分割,也称为多目标跟踪与分割,是2019年提出的新兴计算机视觉研究方向,旨在同时实现视频中实例的检测、分割和跟踪。通过有效分析和利用视频中的视觉信息来完成视频实例分割任务,可以支持一系列基于计算机视觉的应用(如人体动作识别、医学图像处理、自动驾驶导航、监控等)。由于深度学习技术在多个计算机视觉领域占据主导地位,大量基于深度学习的视频实例分割方案已被提出。本综述从多维度视角审视深度学习方法在视频实例分割中的应用,涵盖多种架构范式,并对比了功能性能、模型复杂度和计算开销。除常见架构设计外,还整理并讨论了用于提升视频实例分割深度学习模型性能的辅助技术。最后,我们探讨了一系列主要挑战及未来研究方向,以推动这一前景广阔的研究领域的发展。