Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x-10x faster, running at up to 160 FPS with a ViT-L backbone. Code: https://www.tue-mps.org/videomt/
翻译:现有的在线视频分割模型通常将逐帧分割器与复杂的专用跟踪模块相结合。虽然有效,但这些模块引入了显著的架构复杂性和计算开销。近期研究表明,当具备足够容量和大规模预训练时,朴素的视觉Transformer(ViT)编码器无需专用模块即可实现精确的图像分割。受此启发,我们提出了仅编码器视频掩码Transformer(VidEoMT),这是一种简单的仅编码器视频分割模型,无需专用跟踪模块。为了在仅编码器的ViT中实现时序建模,VidEoMT引入了一种轻量级的查询传播机制,通过复用前一帧的查询来跨帧传递信息。为了在传播与适应新内容之间取得平衡,它采用了一种查询融合策略,将传播的查询与一组时序无关的学习查询相结合。因此,VidEoMT获得了跟踪器的优势而无需增加复杂性,在达到有竞争力的精度的同时,速度提升了5到10倍,在使用ViT-L骨干网络时运行速度高达160 FPS。代码:https://www.tue-mps.org/videomt/