In recent years, finding an effective and efficient strategy for exploiting spatial and temporal information has been a hot research topic in video saliency prediction (VSP). With the emergence of spatio-temporal transformers, the weakness of the prior strategies, e.g., 3D convolutional networks and LSTM-based networks, for capturing long-range dependencies has been effectively compensated. While VSP has drawn benefits from spatio-temporal transformers, finding the most effective way for aggregating temporal features is still challenging. To address this concern, we propose a transformer-based video saliency prediction approach with high temporal dimension decoding network (THTD-Net). This strategy accounts for the lack of complex hierarchical interactions between features that are extracted from the transformer-based spatio-temporal encoder: in particular, it does not require multiple decoders and aims at gradually reducing temporal features' dimensions in the decoder. This decoder-based architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.
翻译:近年来,寻找利用空间和时间信息的有效策略已成为视频显著性预测(VSP)领域的研究热点。随着时空Transformer的出现,先前策略(如3D卷积网络和基于LSTM的网络)在捕获长程依赖关系方面的缺陷得到了有效弥补。尽管VSP已从时空Transformer中获益,但如何最有效地聚合时间特征仍具挑战性。为解决这一问题,我们提出了一种基于Transformer的、带有高时间维度解码网络的视频显著性预测方法(THTD-Net)。该策略针对从基于Transformer的时空编码器中提取的特征之间缺乏复杂层次交互的问题:特别是,它不需要多个解码器,而是旨在逐步降低解码器中时间特征的维度。这种基于解码器的架构在DHF1K、UCF-sports和Hollywood-2等通用基准上取得了与多分支和过度复杂模型相当的性能。