The task of object segmentation in videos is usually accomplished by processing appearance and motion information separately using standard 2D convolutional networks, followed by a learned fusion of the two sources of information. On the other hand, 3D convolutional networks have been successfully applied for video classification tasks, but have not been leveraged as effectively to problems involving dense per-pixel interpretation of videos compared to their 2D convolutional counterparts and lag behind the aforementioned networks in terms of performance. In this work, we show that 3D CNNs can be effectively applied to dense video prediction tasks such as salient object segmentation. We propose a simple yet effective encoder-decoder network architecture consisting entirely of 3D convolutions that can be trained end-to-end using a standard cross-entropy loss. To this end, we leverage an efficient 3D encoder, and propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules. Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal dataset benchmarks in addition to being faster, thus showing that our architecture can efficiently learn expressive spatio-temporal features and produce high quality video segmentation masks. We have made our code and trained models publicly available at https://github.com/sabarim/3DC-Seg.
翻译:视频中的目标分割任务通常通过分离处理外观和运动信息,使用标准二维卷积网络分别处理后再对这两类信息进行学习融合来完成。另一方面,三维卷积网络虽已成功应用于视频分类任务,但在涉及视频逐像素密集解释的问题中,其效能尚未达到二维卷积网络的效果,且在性能上落后于上述网络。本文证明:三维卷积神经网络可有效应用于显著性目标分割等密集视频预测任务。我们提出一种简洁高效的编码器-解码器网络架构,该架构完全由三维卷积构成,可通过标准交叉熵损失进行端到端训练。为此,我们利用高效的三维编码器,并提出包含新型三维全局卷积层和三维精化模块的三维解码器架构。我们的方法在DAVIS'16无监督数据集、FBMS和ViSal数据集基准测试中,不仅处理速度更快,且大幅超越现有最优方法,表明该架构能有效学习表达性时空特征,生成高质量视频分割掩码。我们已公开代码与训练模型:https://github.com/sabarim/3DC-Seg。