Multiscale video transformers have been explored in a wide variety of vision tasks. To date, however, the multiscale processing has been confined to the encoder or decoder alone. We present a unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in videos. Multiscale representation at both encoder and decoder yields key benefits of implicit extraction of spatiotemporal features (i.e. without reliance on input optical flow) as well as temporal consistency at encoding and coarseto-fine detection for high-level (e.g. object) semantics to guide precise localization at decoding. Moreover, we propose a transductive learning scheme through many-to-many label propagation to provide temporally consistent predictions. We showcase our Multiscale Encoder-Decoder Video Transformer (MED-VT) on Automatic Video Object Segmentation (AVOS) and actor/action segmentation, where we outperform state-of-the-art approaches on multiple benchmarks using only raw images, without using optical flow.
翻译:多尺度视频Transformer已在多种视觉任务中得到探索,然而目前多尺度处理仅局限于编码器或解码器端。本文提出一种统一的多尺度编码器-解码器Transformer结构,专门用于视频密集预测任务。编码器与解码器端的多尺度表征带来了如下关键优势:隐式时空特征提取(即无需依赖输入光流)、编码阶段的时序一致性保持,以及通过由粗到精的检测机制利用高层(如目标)语义指导解码阶段的精确定位。此外,我们提出了一种基于多对多标签传播的直推式学习方案,以生成时序一致的预测结果。我们在自动视频目标分割(AVOS)与演员/动作分割任务上验证了所提多尺度编码器-解码器视频Transformer(MED-VT)的性能。仅使用原始图像输入(无需光流),该方法在多个基准数据集上超越了现有最先进方案。