This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in the challenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blend seamlessly in the surroundings for videos, due to similar colors and textures, poor light conditions, etc. Compared to the objects in normal scenes, camouflaged objects are much more difficult to detect. SAM2, a video foundation model, has shown potential in various tasks. But its effectiveness in dynamic camouflaged scenarios remains under-explored. This study presents a comprehensive study on SAM2's ability in VCOS. First, we assess SAM2's performance on camouflaged video datasets using different models and prompts (click, box, and mask). Second, we explore the integration of SAM2 with existing multimodal large language models (MLLMs) and VCOS methods. Third, we specifically adapt SAM2 by fine-tuning it on the video camouflaged dataset. Our comprehensive experiments demonstrate that SAM2 has excellent zero-shot ability of detecting camouflaged objects in videos. We also show that this ability could be further improved by specifically adjusting SAM2's parameters for VCOS. The code will be available at https://github.com/zhoustan/SAM2-VCOS
翻译:本研究探讨了Segment Anything Model 2(SAM2)在视频伪装目标分割(VCOS)这一挑战性任务中的应用与性能。VCOS涉及检测视频中因颜色纹理相似、光照条件不佳等因素而与背景无缝融合的伪装目标。相较于常规场景中的目标,伪装目标的检测难度显著更高。SAM2作为视频基础模型,已在多项任务中展现出潜力,但其在动态伪装场景中的有效性仍有待深入探索。本研究对SAM2在VCOS任务中的能力进行了系统性研究:首先,我们通过不同模型配置与提示方式(点击、框选、掩码)评估SAM2在伪装视频数据集上的表现;其次,探索了SAM2与现有多模态大语言模型(MLLMs)及VCOS方法的融合方案;最后,通过对视频伪装数据集进行微调,实现了SAM2的专项适配。综合实验表明,SAM2在视频伪装目标检测方面具备卓越的零样本能力,且通过针对VCOS任务调整模型参数可进一步提升其性能。代码将发布于https://github.com/zhoustan/SAM2-VCOS。