ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary Segmentation

Segment Anything Model 3 (SAM 3) provides a strong frozen backbone for concept-prompted segmentation, but applying it directly to open-vocabulary semantic segmentation (OVSS) is inefficient: full-resolution decoding is typically run over the entire dataset vocabulary, whereas each image contains only a small active subset of classes. We introduce ActiveSAM, a training-free, zero-shot inference framework that turns SAM 3 into an active-vocabulary segmenter. ActiveSAM first canonicalizes and expands class prompts, then estimates an image-conditioned active set from a low-resolution presence preview. Only the retained classes are decoded at full resolution, using bucketed prompt multiplexing with the frozen SAM 3 decoder. The preview stage uses only class-presence evidence and skips unnecessary segmentation-head computation, while the final stage applies margin-aware background calibration to suppress low-confidence pixels. ActiveSAM requires no target-dataset training, no weight updates, and no oracle class-presence labels. Across eight OVSS benchmarks, ActiveSAM improves the speed-accuracy tradeoff of training-free open-vocabulary semantic segmentation, outperforming the current state-of-the-art SegEarth-OV3 by approximately +1.4 mIoU on average while running up to 5.5x faster on large-vocabulary datasets. ActiveSAM also demonstrates the strongest robustness under image corruption that simulates real-world distribution shift, making it well-suited for deployment in noisy-input domains such as autonomous driving and embodied AI. Code is available at https://github.com/VILA-Lab/ActiveSAM.

翻译：分割一切模型3（SAM 3）为概念驱动分割提供了强大的冻结主干网络，但将其直接应用于开放词汇语义分割（OVSS）存在效率问题：全分辨率解码通常需在完整数据集词汇表上运行，而每张图像仅包含少量活跃类别子集。本文提出ActiveSAM——一种免训练、零样本推理框架，可将SAM 3转化为活跃词汇分割器。ActiveSAM首先对类别提示进行规范化和扩展，随后通过低分辨率存在性预览估计图像条件活跃集，仅对保留类别采用桶式提示多路复用技术结合冻结的SAM 3解码器进行全分辨率解码。预览阶段仅利用类别存在性证据，跳过不必要的分割头计算；最终阶段则采用边界感知背景校准抑制低置信度像素。ActiveSAM无需目标数据集训练、参数更新或先验类别存在标签。在八个OVSS基准测试中，ActiveSAM在免训练开放词汇语义分割的速度-精度权衡上实现突破：平均交并比（mIoU）较当前最优方法SegEarth-OV3提升约1.4%，同时在大型词汇数据集上的推理速度最高提升5.5倍。ActiveSAM在模拟真实分布偏移的图像损坏场景中展现出最强鲁棒性，特别适用于自动驾驶和具身智能等噪声输入场景。代码开源于https://github.com/VILA-Lab/ActiveSAM。