Video semantic segmentation (VSS) is a computationally expensive task due to the per-frame prediction for videos of high frame rates. In recent work, compact models or adaptive network strategies have been proposed for efficient VSS. However, they did not consider a crucial factor that affects the computational cost from the input side: the input resolution. In this paper, we propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient VSS. AR-Seg aims to reduce the computational cost by using low resolution for non-keyframes. To prevent the performance degradation caused by downsampling, we design a Cross Resolution Feature Fusion (CReFF) module, and supervise it with a novel Feature Similarity Training (FST) strategy. Specifically, CReFF first makes use of motion vectors stored in a compressed video to warp features from high-resolution keyframes to low-resolution non-keyframes for better spatial alignment, and then selectively aggregates the warped features with local attention mechanism. Furthermore, the proposed FST supervises the aggregated features with high-resolution features through an explicit similarity loss and an implicit constraint from the shared decoding layer. Extensive experiments on CamVid and Cityscapes show that AR-Seg achieves state-of-the-art performance and is compatible with different segmentation backbones. On CamVid, AR-Seg saves 67% computational cost (measured in GFLOPs) with the PSPNet18 backbone while maintaining high segmentation accuracy. Code: https://github.com/THU-LYJ-Lab/AR-Seg.
翻译:视频语义分割(VSS)因需对高帧率视频逐帧预测而计算成本高昂。近年来,研究者提出轻量模型或自适应网络策略以实现高效VSS,但未考虑影响输入侧计算成本的关键因素:输入分辨率。本文针对压缩视频提出一种可调分辨率框架AR-Seg,通过为非关键帧采用低分辨率来降低计算成本。为缓解下采样导致的性能下降,我们设计了跨分辨率特征融合模块(CReFF),并辅以新型特征相似性训练策略(FST)进行监督。具体而言,CReFF首先利用压缩视频中存储的运动矢量,将高分辨率关键帧的特征扭曲对齐至低分辨率非关键帧以实现更优空间一致性,而后通过局部注意力机制选择性地聚合扭曲后的特征。此外,所提FST通过显式相似性损失与共享解码层的隐式约束,迫使聚合特征与高分辨率特征保持相似。在CamVid和Cityscapes上的大量实验表明,AR-Seg达到最优性能,且兼容不同分割骨干网络。在CamVid上,AR-Seg结合PSPNet18骨干网络可节省67%的计算成本(以GFLOPs度量),同时保持高分割精度。代码:https://github.com/THU-LYJ-Lab/AR-Seg。