Segment Anything Model for automated image data annotation: empirical studies using text prompts from Grounding DINO

Grounding DINO and the Segment Anything Model (SAM) have achieved impressive performance in zero-shot object detection and image segmentation, respectively. Together, they have a great potential to revolutionize applications in zero-shot semantic segmentation or data annotation. Yet, in specialized domains like medical image segmentation, objects of interest (e.g., organs, tissues, and tumors) may not fall in existing class names. To address this problem, the referring expression comprehension (REC) ability of Grounding DINO is leveraged to detect arbitrary targets by their language descriptions. However, recent studies have highlighted severe limitation of the REC framework in this application setting owing to its tendency to make false positive predictions when the target is absent in the given image. And, while this bottleneck is central to the prospect of open-set semantic segmentation, it is still largely unknown how much improvement can be achieved by studying the prediction errors. To this end, we perform empirical studies on six publicly available datasets across different domains and reveal that these errors consistently follow a predictable pattern and can, thus, be mitigated by a simple strategy. Specifically, we show that false positive detections with appreciable confidence scores generally occupy large image areas and can usually be filtered by their relative sizes. More importantly, we expect these observations to inspire future research in improving REC-based detection and automated segmentation. Meanwhile, we evaluate the performance of SAM on multiple datasets from various specialized domains and report significant improvements in segmentation performance and annotation time savings over manual approaches.

翻译：Grounding DINO与Segment Anything Model（SAM）分别在零样本目标检测与图像分割领域取得了显著性能。二者的结合为零样本语义分割或数据标注应用带来了革命性潜力。然而在医学图像分割等专业领域中，目标对象（如器官、组织与肿瘤）可能不属于现有类别名称体系。为解决此问题，本研究利用Grounding DINO的指代表达理解能力，通过语言描述检测任意目标。但近期研究指出，当目标对象在给定图像中缺失时，该REC框架易产生假阳性预测，导致在此应用场景中存在严重局限。尽管这一瓶颈是开放集语义分割发展前景的核心问题，目前仍鲜有研究通过系统分析预测误差来评估其改进空间。为此，我们在六个不同领域的公开数据集上开展实证研究，发现这些误差始终遵循可预测的模式，并能通过简单策略有效缓解。具体而言，我们证明具有显著置信度的假阳性检测通常占据较大图像区域，可通过相对尺寸进行过滤。更重要的是，这些发现有望为改进基于REC的检测与自动分割技术提供研究启示。同时，我们在多个专业领域数据集上评估SAM的性能，结果显示其相较于人工方法在分割性能与标注时间效率方面均实现显著提升。