SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundation model for object segmentation in both images and videos, paving the way for various downstream video applications. The crucial design of SAM 2 for video segmentation is its memory module, which prompts object-aware memories from previous frames for current frame prediction. However, its greedy-selection memory design suffers from the "error accumulation" problem, where an errored or missed mask will cascade and influence the segmentation of the subsequent frames, which limits the performance of SAM 2 toward complex long-term videos. To this end, we introduce SAM2Long, an improved training-free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways in a constrained tree search manner. In practice, we maintain a fixed number of segmentation pathways throughout the video. For each frame, multiple masks are proposed based on the existing pathways, creating various candidate branches. We then select the same fixed number of branches with higher cumulative scores as the new pathways for the next frame. After processing the final frame, the pathway with the highest cumulative score is chosen as the final segmentation result. Benefiting from its heuristic search design, SAM2Long is robust toward occlusions and object reappearances, and can effectively segment and track objects for complex long-term videos. Notably, SAM2Long achieves an average improvement of 3.0 points across all 24 head-to-head comparisons, with gains of up to 5.3 points in J&F on long-term video object segmentation benchmarks such as SA-V and LVOS. The code is released at https://github.com/Mark12Ding/SAM2Long.

翻译：Segment Anything Model 2（SAM 2）已成为图像和视频对象分割的强大基础模型，为各类下游视频应用开辟了道路。SAM 2视频分割的核心设计在于其记忆模块，该模块从历史帧提取对象感知记忆以指导当前帧的预测。然而，其贪心选择式的记忆设计存在“误差累积”问题——单个帧中的错误或缺失掩码会逐级传播并影响后续帧的分割效果，这限制了SAM 2在复杂长视频中的性能表现。为此，我们提出SAM2Long：一种改进的免训练视频对象分割策略。该方法通过考虑每帧内的分割不确定性，以受限树搜索方式从多条分割路径中选择视频级最优结果。具体实现中，我们在整个视频处理过程中维持固定数量的分割路径。针对每一帧，基于现有路径生成多个掩码候选分支，随后选取累积分数较高的固定数量分支作为下一帧的新路径。处理完最后一帧后，选择累积分数最高的路径作为最终分割结果。得益于启发式搜索设计，SAM2Long对遮挡和对象重现具有鲁棒性，能有效完成复杂长视频的对象分割与追踪。值得注意的是，在SA-V和LVOS等长时视频对象分割基准测试中，SAM2Long在全部24项直接对比实验中平均提升3.0个性能点，其中J&F指标最高提升达5.3点。代码已发布于https://github.com/Mark12Ding/SAM2Long。