Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.
翻译:通用分割模型(Segment Anything Models, SAM)在通用分割任务上取得了令人瞩目的性能,但其依赖于海量数据集(例如1100万张图像)且仅使用RGB输入。近期的高效变体虽然降低了计算量,但仍需大规模训练。本文提出一种轻量级RGB-D融合框架,通过单目深度先验增强EfficientViT-SAM。我们使用预训练的深度估计器生成深度图,并通过专用的深度编码器在特征中层与RGB特征进行融合。仅使用1.12万样本(少于SA-1B数据集的0.1%)进行训练,本方法达到了比EfficientViT-SAM更高的精度,表明深度线索能为分割任务提供强大的几何先验。