We introduce ViDaS, a two-stream, fully convolutional Video, Depth-Aware Saliency network to address the problem of attention modeling ``in-the-wild", via saliency prediction in videos. Contrary to existing visual saliency approaches using only RGB frames as input, our network employs also depth as an additional modality. The network consists of two visual streams, one for the RGB frames, and one for the depth frames. Both streams follow an encoder-decoder approach and are fused to obtain a final saliency map. The network is trained end-to-end and is evaluated in a variety of different databases with eye-tracking data, containing a wide range of video content. Although the publicly available datasets do not contain depth, we estimate it using three different state-of-the-art methods, to enable comparisons and a deeper insight. Our method outperforms in most cases state-of-the-art models and our RGB-only variant, which indicates that depth can be beneficial to accurately estimating saliency in videos displayed on a 2D screen. Depth has been widely used to assist salient object detection problems, where it has been proven to be very beneficial. Our problem though differs significantly from salient object detection, since it is not restricted to specific salient objects, but predicts human attention in a more general aspect. These two problems not only have different objectives, but also different ground truth data and evaluation metrics. To our best knowledge, this is the first competitive deep learning video saliency estimation approach that combines both RGB and Depth features to address the general problem of saliency estimation ``in-the-wild". The code will be publicly released.
翻译:我们提出ViDaS,一种双流全卷积视频深度感知显著性网络,旨在通过视频中的显著性预测解决“野外场景”注意力建模问题。与仅使用RGB帧作为输入的现有视觉显著性方法不同,我们的网络还将深度作为附加模态。该网络包含两个视觉流,分别处理RGB帧和深度帧。两个流均采用编码器-解码器架构,并通过融合获得最终显著性图。网络以端到端方式训练,并在包含广泛视频内容的多个眼动数据数据库上进行了评估。尽管公开数据集不包含深度信息,但我们采用三种不同先进方法对其进行估计,以便进行对比和深入分析。我们的方法在大多数情况下优于现有最先进模型及仅使用RGB输入的变体,这表明深度信息有助于准确估计在2D屏幕上显示的视频中的显著性。深度信息已被广泛用于辅助显著目标检测问题,并已被证明非常有效。然而,我们的问题与显著目标检测存在显著差异:它不局限于特定显著目标,而是从更广泛的角度预测人类注意力。这两个问题不仅目标不同,而且真实标记数据和评估指标也不同。据我们所知,这是首个结合RGB和深度特征以应对“野外场景”通用显著性估计问题的具有竞争力的深度学习视频显著性估计方法。代码将公开发布。