Although recent approaches aiming for video instance segmentation have achieved promising results, it is still difficult to employ those approaches for real-world applications on mobile devices, which mainly suffer from (1) heavy computation and memory cost and (2) complicated heuristics for tracking objects. To address those issues, we present MobileInst, a lightweight and mobile-friendly framework for video instance segmentation on mobile devices. Firstly, MobileInst adopts a mobile vision transformer to extract multi-level semantic features and presents an efficient query-based dual-transformer instance decoder for mask kernels and a semantic-enhanced mask decoder to generate instance segmentation per frame. Secondly, MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation. Further, we propose temporal query passing to enhance the tracking ability for kernels. We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst and evaluate the inference latency on a mobile CPU core of Qualcomm Snapdragon-778G, without other methods of acceleration. On the COCO dataset, MobileInst achieves 30.5 mask AP and 176 ms on the mobile CPU, which reduces the latency by 50% compared to the previous SOTA. For video instance segmentation, MobileInst achieves 35.0 AP on YouTube-VIS 2019 and 30.1 AP on YouTube-VIS 2021. Code will be available to facilitate real-world applications and future research.
翻译:尽管近期针对视频实例分割的方法已取得显著进展,但这些方法仍难以在移动设备的实际应用中部署,主要受限于(1)高昂的计算与内存开销,以及(2)复杂的跟踪目标启发式算法。为解决这些问题,我们提出MobileInst——一种轻量级且适合移动设备的视频实例分割框架。首先,MobileInst采用移动视觉Transformer提取多层级语义特征,并提出基于高效查询的双Transformer实例解码器生成掩码核,同时结合语义增强型掩码解码器逐帧生成实例分割。其次,MobileInst利用简单有效的核重用与核关联机制实现视频实例分割中的目标跟踪。此外,我们提出时序查询传递机制以增强核的跟踪能力。在COCO和YouTube-VIS数据集上的实验证明了MobileInst的优越性,并在骁龙778G移动CPU核心上评估了推理延迟(无其他加速方法)。在COCO数据集中,MobileInst在移动CPU上达到30.5掩码AP和176毫秒延迟,相比先前最优方法延迟降低50%。对于视频实例分割,MobileInst在YouTube-VIS 2019和YouTube-VIS 2021上分别达到35.0 AP和30.1 AP的精度。代码将公开以促进实际应用与未来研究。