Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval

Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real-time video processing. Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post-training acceleration. In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration: i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions. ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient regions exhibit temporal consistency, making full-token computation redundant. With these insights, we propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations, thereby significantly improving inference efficiency. Specifically, for image encoder, we propose object-aware Sparse Window Routing (SWR), a window-level computation allocation mechanism that leverages the consistency and saliency cues from the previous-frame decoder to route background regions into a lightweight shortcut branch. Moreover, for memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which allows only the salient memory tokens in each frame to participate in computation, with the saliency pattern reused from their first recollection. With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68x speedup on SAM2.1-L model with only 1.0% accuracy drop on SA-V test set.

翻译：Segment Anything Model 2 (SAM2) 在视频对象分割任务中展现出卓越性能，但其沉重的计算负担阻碍了其在实时视频处理中的应用。尽管已有研究致力于提升SAM2的效率，但多数工作聚焦于重新训练轻量级骨干网络，对训练后加速的探索较少。本文中，我们观察到SAM2表现出与生物视觉类似的稀疏感知模式，这为消除冗余计算和实现加速提供了契机：i) 在掩码解码器中，注意力主要集中于前景对象，而早期阶段的图像编码器则表现出广泛的注意力范围，导致对背景区域进行不必要的计算。ii) 在记忆库中，每帧仅有少量标记对记忆注意力贡献显著，且显著区域表现出时间一致性，这使得全标记计算变得冗余。基于这些洞察，我们提出了Efficient-SAM2，它促使SAM2自适应地聚焦于对象区域，同时消除与任务无关的计算，从而显著提升推理效率。具体而言，对于图像编码器，我们提出了对象感知的稀疏窗口路由（Sparse Window Routing, SWR），这是一种窗口级计算分配机制，利用来自前一帧解码器的一致性和显著性线索，将背景区域路由至轻量级捷径分支。此外，对于记忆注意力，我们提出了对象感知的稀疏记忆检索（Sparse Memory Retrieval, SMR），它仅允许每帧中的显著记忆标记参与计算，并重用其首次被记录时的显著性模式。在仅增加可忽略的额外参数和最小训练开销的情况下，Efficient-SAM2在SAM2.1-L模型上实现了1.68倍的加速，在SA-V测试集上仅导致1.0%的精度下降。