Recent approaches for fast semantic video segmentation have reduced redundancy by warping feature maps across adjacent frames, greatly speeding up the inference phase. However, the accuracy drops seriously owing to the errors incurred by warping. In this paper, we propose a novel framework and design a simple and effective correction stage after warping. Specifically, we build a non-key-frame CNN, fusing warped context features with current spatial details. Based on the feature fusion, our Context Feature Rectification~(CFR) module learns the model's difference from a per-frame model to correct the warped features. Furthermore, our Residual-Guided Attention~(RGA) module utilizes the residual maps in the compressed domain to help CRF focus on error-prone regions. Results on Cityscapes show that the accuracy significantly increases from $67.3\%$ to $71.6\%$, and the speed edges down from $65.5$ FPS to $61.8$ FPS at a resolution of $1024\times 2048$. For non-rigid categories, e.g., ``human'' and ``object'', the improvements are even higher than 18 percentage points.
翻译:近期针对快速语义视频分割的方法通过跨相邻帧变形特征图来减少冗余,显著加速了推理阶段。然而,由于变形引入的误差,精度严重下降。本文提出一种新颖框架,在变形后设计简单有效的校正阶段。具体而言,我们构建非关键帧CNN,将变形后的上下文特征与当前空间细节融合。基于特征融合,我们的上下文特征校正(CFR)模块学习模型与逐帧模型之间的差异以校正变形特征。此外,残差引导注意力(RGA)模块利用压缩域中的残差图帮助CRF聚焦于易错区域。在Cityscapes数据集上的实验结果表明,在$1024\times 2048$分辨率下,精度从$67.3\%$显著提升至$71.6\%$,而速度从$65.5$ FPS略微下降至$61.8$ FPS。对于非刚体类别(如“行人”和“物体”),改进幅度甚至超过18个百分点。