Salient object detection (SOD) in panoramic video is still in the initial exploration stage. The indirect application of 2D video SOD method to the detection of salient objects in panoramic video has many unmet challenges, such as low detection accuracy, high model complexity, and poor generalization performance. To overcome these hurdles, we design an Inter-Layer Attention (ILA) module, an Inter-Layer weight (ILW) module, and a Bi-Modal Attention (BMA) module. Based on these modules, we propose a Spatial-Temporal Dual-Mode Mixed Flow Network (STDMMF-Net) that exploits the spatial flow of panoramic video and the corresponding optical flow for SOD. First, the ILA module calculates the attention between adjacent level features of consecutive frames of panoramic video to improve the accuracy of extracting salient object features from the spatial flow. Then, the ILW module quantifies the salient object information contained in the features of each level to improve the fusion efficiency of the features of each level in the mixed flow. Finally, the BMA module improves the detection accuracy of STDMMF-Net. A large number of subjective and objective experimental results testify that the proposed method demonstrates better detection accuracy than the state-of-the-art (SOTA) methods. Moreover, the comprehensive performance of the proposed method is better in terms of memory required for model inference, testing time, complexity, and generalization performance.
翻译:全景视频中的显著性目标检测仍处于初步探索阶段。将二维视频显著性目标检测方法间接应用于全景视频显著目标检测存在诸多未解决的挑战,例如检测精度低、模型复杂度高以及泛化性能差。为克服这些障碍,我们设计了层间注意力模块、层间权重模块以及双模态注意力模块。基于这些模块,我们提出了一种时空双模态混合流网络,该网络利用全景视频的空间流及其对应的光流进行显著性目标检测。首先,层间注意力模块计算全景视频连续帧相邻层级特征间的注意力,以提高从空间流提取显著目标特征的准确性。其次,层间权重模块量化各层级特征中包含的显著目标信息,以提升混合流中各层级特征的融合效率。最后,双模态注意力模块提升了所提网络的检测精度。大量主观与客观实验结果表明,所提方法在检测精度上优于当前最优方法。此外,在模型推理所需内存、测试时间、复杂度及泛化性能等综合性能方面,所提方法表现更优。