Moving object segmentation is a crucial task for achieving a high-level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state-of-the-art performance, excelling in challenging scenarios and fine-grained segmentation of multiple objects. Our code is available at https://motion-seg.github.io/.
翻译:运动目标分割是实现对视觉场景高层次理解的关键任务,具有众多下游应用。人类能够轻松分割视频中的运动物体。先前工作主要依赖光流提供运动线索;然而,由于部分运动、复杂形变、运动模糊及背景干扰等挑战,该方法常导致预测结果不完美。我们提出一种新颖的运动目标分割方法,该方法将长程轨迹运动线索与基于DINO的语义特征相结合,并通过迭代提示策略利用SAM2进行像素级掩码稠密化。我们的模型采用时空轨迹注意力与运动-语义解耦嵌入机制,在整合语义支持的同时优先处理运动信息。在多样化数据集上的广泛测试表明,该方法取得了最先进的性能,在具有挑战性的场景及多目标细粒度分割任务中表现优异。我们的代码发布于 https://motion-seg.github.io/。