Current prevailing Video Object Segmentation (VOS) methods usually perform dense matching between the current and reference frames after extracting their features. One on hand, the decoupled modeling restricts the targets information propagation only at high-level feature space. On the other hand, the pixel-wise matching leads to a lack of holistic understanding of the targets. To overcome these issues, we propose a unified VOS framework, coined as JointFormer, for joint modeling the three elements of feature, correspondence, and a compressed memory. The core design is the Joint Block, utilizing the flexibility of attention to simultaneously extract feature and propagate the targets information to the current tokens and the compressed memory token. This scheme allows to perform extensive information propagation and discriminative feature learning. To incorporate the long-term temporal targets information, we also devise a customized online updating mechanism for the compressed memory token, which can prompt the information flow along the temporal dimension and thus improve the global modeling capability. Under the design, our method achieves a new state-of-art performance on DAVIS 2017 val/test-dev (89.7% and 87.6%) and YouTube-VOS 2018/2019 val (87.0% and 87.0%) benchmarks, outperforming existing works by a large margin.
翻译:当前主流的视频目标分割方法通常在对当前帧和参考帧提取特征后进行密集匹配。一方面,这种解耦式建模将目标信息传播限制在高层特征空间;另一方面,逐像素匹配导致缺乏对目标的全局理解。为解决上述问题,我们提出统一视频目标分割框架JointFormer,对特征、对应关系和压缩记忆三个要素进行联合建模。其核心设计是联合块,利用注意力机制的灵活性同时提取特征,并将目标信息传播至当前令牌和压缩记忆令牌。该方案能够实现广泛的信息传播和判别性特征学习。为融入长期时序目标信息,我们还设计了压缩记忆令牌的定制化在线更新机制,可促进信息沿时间维度流动,从而提升全局建模能力。基于该设计,我们的方法在DAVIS 2017 val/test-dev基准(89.7%和87.6%)和YouTube-VOS 2018/2019 val基准(87.0%和87.0%)上均取得新最优性能,大幅超越现有方法。