SpVOS: Efficient Video Object Segmentation with Triple Sparse Convolution

Semi-supervised video object segmentation (Semi-VOS), which requires only annotating the first frame of a video to segment future frames, has received increased attention recently. Among existing pipelines, the memory-matching-based one is becoming the main research stream, as it can fully utilize the temporal sequence information to obtain high-quality segmentation results. Even though this type of method has achieved promising performance, the overall framework still suffers from heavy computation overhead, mainly caused by the per-frame dense convolution operations between high-resolution feature maps and each kernel filter. Therefore, we propose a sparse baseline of VOS named SpVOS in this work, which develops a novel triple sparse convolution to reduce the computation costs of the overall VOS framework. The designed triple gate, taking full consideration of both spatial and temporal redundancy between adjacent video frames, adaptively makes a triple decision to decide how to apply the sparse convolution on each pixel to control the computation overhead of each layer, while maintaining sufficient discrimination capability to distinguish similar objects and avoid error accumulation. A mixed sparse training strategy, coupled with a designed objective considering the sparsity constraint, is also developed to balance the VOS segmentation performance and computation costs. Experiments are conducted on two mainstream VOS datasets, including DAVIS and Youtube-VOS. Results show that, the proposed SpVOS achieves superior performance over other state-of-the-art sparse methods, and even maintains comparable performance, e.g., an 83.04% (79.29%) overall score on the DAVIS-2017 (Youtube-VOS) validation set, with the typical non-sparse VOS baseline (82.88% for DAVIS-2017 and 80.36% for Youtube-VOS) while saving up to 42% FLOPs, showing its application potential for resource-constrained scenarios.

翻译：半监督视频目标分割（Semi-VOS）仅需标注视频首帧即可分割后续帧，近年来受到广泛关注。现有方法中，基于记忆匹配的框架因能充分利用时序序列信息获得高质量分割结果，已成为主要研究趋势。尽管此类方法已取得优异性能，其整体框架仍面临巨大计算开销——主要源于高分辨率特征图与各卷积核之间的逐帧密集卷积操作。为此，本文提出稀疏化VOS基线方法SpVOS，通过新型三重稀疏卷积降低整体计算成本。所设计的三重门控机制充分考虑相邻视频帧之间的时空冗余性，自适应做出三重决策以确定每个像素如何应用稀疏卷积，从而控制各层计算开销，同时保持足够的判别能力区分相似目标并避免误差累积。我们进一步开发混合稀疏训练策略，结合考虑稀疏约束的目标函数，平衡VOS分割性能与计算开销。在DAVIS和Youtube-VOS两个主流数据集上的实验表明，所提SpVOS方法优于现有最先进稀疏方法，甚至保持与非稀疏VOS基线（DAVIS-2017为82.88%，Youtube-VOS为80.36%）相当的性能——在DAVIS-2017和Youtube-VOS验证集上分别达到83.04%和79.29%的整体评分，同时节省高达42%的FLOPs，展现出其在资源受限场景下的应用潜力。