Previous methods based on 3DCNN, convLSTM, or optical flow have achieved great success in video salient object detection (VSOD). However, they still suffer from high computational costs or poor quality of the generated saliency maps. To solve these problems, we design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD. Furthermore, previous methods only considered single-frame prediction without temporal association. As a result, the model may not focus on the temporal information sufficiently. Thus, we initially introduce object motion prediction between inter-frame into VSOD. Our model follows standard encoder--decoder architecture. In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames. This approach is more efficient than the optical flow-based methods. In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches. The semantic information of the high-level features is used to fuse the object details in the low-level features, and then the spatiotemporal features are obtained step by step to reconstruct the saliency maps. Moreover, inspired by the boundary supervision commonly used in image salient object detection (ISOD), we design a motion-aware loss for predicting object boundary motion and simultaneously perform multitask learning for VSOD and object motion prediction, which can further facilitate the model to extract spatiotemporal features accurately and maintain the object integrity. Extensive experiments on several datasets demonstrated the effectiveness of our method and can achieve state-of-the-art metrics on some datasets. The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.
翻译:基于3DCNN、convLSTM或光流的先前方法在视频显著性目标检测(VSOD)中取得了巨大成功。然而,这些方法仍面临计算成本高或生成显著性图质量较差的问题。为解决这些问题,我们设计了一种基于时空记忆(STM)的网络,从相邻帧中提取当前帧的有效时序信息作为VSOD的时序分支。此外,先前方法仅考虑单帧预测而未建立时序关联,导致模型可能无法充分关注时序信息。为此,我们首次将帧间运动预测引入VSOD。模型采用标准编码器-解码器架构:编码阶段,利用当前帧及其相邻帧的高层特征生成高层时序特征——该方法比基于光流的方法更高效;解码阶段,我们提出空间分支与时序分支的有效融合策略,利用高层特征的语义信息融合低层特征中的物体细节,逐步获取时空特征以重建显著性图。此外,受图像显著性目标检测(ISOD)中常用边界监督的启发,我们设计了一种运动感知损失用于预测目标边界运动,同时进行VSOD与目标运动预测的多任务学习,进一步促进模型准确提取时空特征并保持物体完整性。在多个数据集上的大量实验证明了方法的有效性,部分数据集上达到了最先进的指标。所提模型无需光流或其他预处理,推理速度可达近100 FPS。