Mask Propagation for Efficient Video Semantic Segmentation

Video Semantic Segmentation (VSS) involves assigning a semantic label to each pixel in a video sequence. Prior work in this field has demonstrated promising results by extending image semantic segmentation models to exploit temporal relationships across video frames; however, these approaches often incur significant computational costs. In this paper, we propose an efficient mask propagation framework for VSS, called MPVSS. Our approach first employs a strong query-based image segmentor on sparse key frames to generate accurate binary masks and class predictions. We then design a flow estimation module utilizing the learned queries to generate a set of segment-aware flow maps, each associated with a mask prediction from the key frame. Finally, the mask-flow pairs are warped to serve as the mask predictions for the non-key frames. By reusing predictions from key frames, we circumvent the need to process a large volume of video frames individually with resource-intensive segmentors, alleviating temporal redundancy and significantly reducing computational costs. Extensive experiments on VSPW and Cityscapes demonstrate that our mask propagation framework achieves SOTA accuracy and efficiency trade-offs. For instance, our best model with Swin-L backbone outperforms the SOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW dataset. Moreover, our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former baseline with only up to 2% mIoU degradation on the Cityscapes validation set. Code is available at https://github.com/ziplab/MPVSS.

翻译：视频语义分割（VSS）涉及为视频序列中的每个像素分配语义标签。该领域的先前工作通过扩展图像语义分割模型以利用视频帧间的时序关系，已展现出令人瞩目的成果；然而，这些方法通常带来显著的计算开销。本文提出一种高效的VSS掩码传播框架，称为MPVSS。我们的方法首先在稀疏关键帧上采用基于强查询的图像分割器，生成精确的二值掩码和类别预测。随后，我们利用学习到的查询设计一个光流估计模块，生成一组与关键帧掩码预测相关联的分段感知光流图。最后，将这些掩码-光流对进行扭曲，作为非关键帧的掩码预测。通过复用关键帧的预测结果，我们无需使用资源密集的分割器逐个处理大量视频帧，从而缓解了时序冗余并大幅降低了计算成本。在VSPW和Cityscapes上的广泛实验表明，我们的掩码传播框架实现了准确率与效率最优的平衡。例如，基于Swin-L骨干网络的模型在VSPW数据集上以仅26%的FLOPs实现了比最佳基准MRCFA（MiT-B5骨干）高4.0% mIoU的性能。此外，与逐帧处理的Mask2Former基线相比，我们的框架在Cityscapes验证集上仅产生最多2% mIoU下降的情况下，将FLOPs降低了4倍。代码已开源至https://github.com/ziplab/MPVSS。