Segment anything model (SAM) has achieved great success in the field of natural image segmentation. Nevertheless, SAM tends to consider shadows as background and therefore does not perform segmentation on them. In this paper, we propose ShadowSAM, a simple yet effective framework for fine-tuning SAM to detect shadows. Besides, by combining it with long short-term attention mechanism, we extend its capability for efficient video shadow detection. Specifically, we first fine-tune SAM on ViSha training dataset by utilizing the bounding boxes obtained from the ground truth shadow mask. Then during the inference stage, we simulate user interaction by providing bounding boxes to detect a specific frame (e.g., the first frame). Subsequently, using the detected shadow mask as a prior, we employ a long short-term network to learn spatial correlations between distant frames and temporal consistency between adjacent frames, thereby achieving precise shadow information propagation across video frames. Extensive experimental results demonstrate the effectiveness of our method, with notable margin over the state-of-the-art approaches in terms of MAE and IoU metrics. Moreover, our method exhibits accelerated inference speed compared to previous video shadow detection approaches, validating the effectiveness and efficiency of our method. The source code is now publicly available at https://github.com/harrytea/Detect-AnyShadow.
翻译:分割一切模型(SAM)在自然图像分割领域取得了巨大成功。然而,SAM倾向于将阴影视为背景,因此无法对其进行分割。本文提出ShadowSAM——一种简单而有效的微调SAM检测阴影的框架。此外,通过结合长短期注意力机制,我们将其能力扩展到高效视频阴影检测。具体而言,我们首先利用真实阴影掩码生成的边界框,在ViSha训练数据集上微调SAM。在推理阶段,我们通过提供边界框来模拟用户交互以检测特定帧(如首帧)。随后,以检测到的阴影掩码为先验,采用长短期网络学习远距离帧间的空间关联性与相邻帧间的时间一致性,从而在视频帧间实现精确的阴影信息传播。大量实验结果表明,本方法在MAE和IoU指标上显著超越现有最优方法,具有明显优势。同时,相比先前的视频阴影检测方法,本方法展现出更快的推理速度,验证了其有效性与高效性。源代码现已公开于https://github.com/harrytea/Detect-AnyShadow。