Video segmentation aims to segment and track every pixel in diverse scenarios accurately. In this paper, we present Tube-Link, a versatile framework that addresses multiple core tasks of video segmentation with a unified architecture. Our framework is a near-online approach that takes a short subclip as input and outputs the corresponding spatial-temporal tube masks. To enhance the modeling of cross-tube relationships, we propose an effective way to perform tube-level linking via attention along the queries. In addition, we introduce temporal contrastive learning to instance-wise discriminative features for tube-level association. Our approach offers flexibility and efficiency for both short and long video inputs, as the length of each subclip can be varied according to the needs of datasets or scenarios. Tube-Link outperforms existing specialized architectures by a significant margin on five video segmentation datasets. Specifically, it achieves almost 13% relative improvements on VIPSeg and 4% improvements on KITTI-STEP over the strong baseline Video K-Net. When using a ResNet50 backbone on Youtube-VIS-2019 and 2021, Tube-Link boosts IDOL by 3% and 4%, respectively.
翻译:视频分割旨在准确地对不同场景中的每个像素进行分割与跟踪。本文提出Tube-Link,一个以统一架构解决视频分割多项核心任务的通用框架。该框架采用近在线方法,通过输入短子片段并输出对应的时空管状掩码。为增强跨管状关系的建模能力,我们提出一种通过查询维度注意力实现管级连接的有效方法。此外,我们引入时序对比学习以增强实例级判别特征,从而实现管级关联。该方法能灵活高效地处理长短视频输入——子片段的长度可根据数据集或场景需求动态调整。在五个视频分割数据集上,Tube-Link显著超越现有专用架构:相较于强基线Video K-Net,其在VIPSeg上相对提升近13%,在KITTI-STEP上提升4%;使用ResNet50骨干网络时,Tube-Link在Youtube-VIS-2019和2021数据集上分别将IDOL提升3%和4%。