Image segmentation foundation models (SFMs) like Segment Anything Model (SAM) have achieved impressive zero-shot and interactive segmentation across diverse domains. However, they struggle to segment objects with certain structures, particularly those with dense, tree-like morphology and low textural contrast from their surroundings. These failure modes are crucial for understanding the limitations of SFMs in real-world applications. To systematically study this issue, we introduce interpretable metrics quantifying object tree-likeness and textural separability. On carefully controlled synthetic experiments and real-world datasets, we show that SFM performance (e.g., SAM, SAM 2, HQ-SAM) noticeably correlates with these factors. We link these failures to "textural confusion", where models misinterpret local structure as global texture, causing over-segmentation or difficulty distinguishing objects from similar backgrounds. Notably, targeted fine-tuning fails to resolve this issue, indicating a fundamental limitation. Our study provides the first quantitative framework for modeling the behavior of SFMs on challenging structures, offering interpretable insights into their segmentation capabilities.
翻译:图像分割基础模型(SFMs),如Segment Anything Model(SAM),已在多种领域实现了令人印象深刻的零样本和交互式分割。然而,它们在分割具有特定结构的物体时仍面临困难,尤其是那些形态密集、呈树状且与周围环境纹理对比度低的物体。这些失败模式对于理解SFMs在现实应用中的局限性至关重要。为系统研究此问题,我们引入了可解释的度量指标,用于量化物体的树状特性和纹理可分离性。在精心控制的合成实验和真实世界数据集上,我们表明SFM的性能(例如SAM、SAM 2、HQ-SAM)与这些因素存在显著相关性。我们将这些失败归因于“纹理混淆”,即模型将局部结构误解为全局纹理,导致过分割或难以将物体与相似背景区分开来。值得注意的是,有针对性的微调未能解决此问题,表明这是一个根本性限制。我们的研究首次提供了一个量化框架,用于建模SFMs在挑战性结构上的行为,为其分割能力提供了可解释的见解。