In this paper, we propose an Audio-Language-Referenced SAM 2 (AL-Ref-SAM 2) pipeline to explore the training-free paradigm for audio and language-referenced video object segmentation, namely AVS and RVOS tasks. The intuitive solution leverages GroundingDINO to identify the target object from a single frame and SAM 2 to segment the identified object throughout the video, which is less robust to spatiotemporal variations due to a lack of video context exploration. Thus, in our AL-Ref-SAM 2 pipeline, we propose a novel GPT-assisted Pivot Selection (GPT-PS) module to instruct GPT-4 to perform two-step temporal-spatial reasoning for sequentially selecting pivot frames and pivot boxes, thereby providing SAM 2 with a high-quality initial object prompt. Within GPT-PS, two task-specific Chain-of-Thought prompts are designed to unleash GPT's temporal-spatial reasoning capacity by guiding GPT to make selections based on a comprehensive understanding of video and reference information. Furthermore, we propose a Language-Binded Reference Unification (LBRU) module to convert audio signals into language-formatted references, thereby unifying the formats of AVS and RVOS tasks in the same pipeline. Extensive experiments on both tasks show that our training-free AL-Ref-SAM 2 pipeline achieves performances comparable to or even better than fully-supervised fine-tuning methods. The code is available at: https://github.com/appletea233/AL-Ref-SAM2.
翻译:本文提出一种音频-语言参考SAM 2(AL-Ref-SAM 2)流程,探索面向音频与语言参考视频目标分割(即AVS与RVOS任务)的免训练范式。传统方案利用GroundingDINO从单帧识别目标对象,再通过SAM 2在全视频中分割该对象,但由于缺乏视频上下文探索,其对时空变化的鲁棒性不足。为此,在AL-Ref-SAM 2流程中,我们提出新型GPT辅助枢轴选择(GPT-PS)模块,通过指令GPT-4执行两步时空推理来顺序选择枢轴帧与枢轴框,从而为SAM 2提供高质量初始目标提示。在GPT-PS模块中,我们设计了两类任务特定的思维链提示,通过引导GPT基于对视频与参考信息的综合理解进行选择,充分释放其时空推理能力。此外,我们提出语言绑定参考统一(LBRU)模块,将音频信号转换为语言格式的参考描述,从而在统一流程中实现AVS与RVOS任务的格式标准化。在两项任务上的大量实验表明,我们的免训练AL-Ref-SAM 2流程取得了与全监督微调方法相当甚至更优的性能。代码已开源:https://github.com/appletea233/AL-Ref-SAM2。