Audio-based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel-level object masks over time, posing challenges in bridging acoustic signals with spatio-temporal visual representations. In this report, we present VIRST-Audio, a practical framework built upon a pretrained RVOS model integrated with a vision-language architecture. Instead of relying on audio-specific training, we convert input audio into text using an ASR module and perform segmentation using text-based supervision, enabling effective transfer from text-based reasoning to audio-driven scenarios. To improve robustness, we further incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on the MeViS-Audio track of the 5th PVUW Challenge, where VIRST-Audio achieves 3rd place, demonstrating strong generalization and reliable performance in audio-based referring video segmentation.
翻译:基于音频的指代视频目标分割(ARVOS)需要将音频查询在时序维度上映射至像素级目标掩码,这对实现声学信号与时空视觉表征的桥接提出了挑战。本报告提出VIRST-Audio这一实用框架,该框架基于预训练的RVOS模型并融合视觉-语言架构。我们并未依赖特定音频训练,而是利用ASR模块将输入音频转换为文本,并通过基于文本的监督进行分割,从而有效实现从文本推理到音频驱动场景的迁移。为提升鲁棒性,我们进一步引入存在感知门控机制——该机制可评估指代目标是否存在于视频中,并在目标缺失时抑制预测,从而减少幻觉掩码并稳定分割行为。我们在第五届PVUW挑战赛的MeViS-Audio赛道中评估该方法,VIRST-Audio最终获得第三名,展示了其在基于音频的指代视频分割任务中强大的泛化能力与可靠性能。