Language-referred audio-visual segmentation (Ref-AVS) aims to segment target objects described by natural language by jointly reasoning over video, audio, and text. Beyond generating segmentation masks, providing rich and interpretable diagnoses of mask quality remains largely underexplored. In this work, we introduce Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS), a new task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations as references at inference time. Given audio-visual-language inputs and each provided segmentation mask, the task requires estimating its IoU with the unobserved ground truth, identifying the corresponding error type, and recommending an actionable quality-control decision. To support this task, we construct MQ-RAVSBench, a benchmark featuring diverse and representative mask error modes that span both geometric and semantic issues. We further propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information to produce quantitative and qualitative mask quality assessments. Extensive experiments demonstrate that MQ-Auditor outperforms strong open-source and commercial MLLMs and can be integrated with existing Ref-AVS systems to detect segmentation failures and support downstream segmentation improvement. Data and codes will be released at https://github.com/jasongief/MQA-RefAVS.
翻译:语言指代视听分割旨在通过对视频、音频和文本进行联合推理,分割出自然语言描述的目标对象。除生成分割掩码外,如何提供丰富且可解释的掩码质量诊断仍鲜有研究。本文在Ref-AVS背景下引入掩码质量评估任务,该任务可在推理阶段不依赖真实标注作为参考的情况下,评估候选分割掩码的质量。给定视听语言输入及每个待评估的分割掩码,该任务需估计其与未观测真实标注之间的交并比,识别对应的错误类型,并给出可执行的质量控制决策。为支持此任务,我们构建了MQ-RAVSBench基准数据集,其涵盖几何与语义层面的多样化代表性掩码错误模式。我们进一步提出MQ-Auditor——一种基于多模态大语言模型的审计器,能够显式地对多模态线索与掩码信息进行推理,以产生定量与定性的掩码质量评估。大量实验表明,MQ-Auditor优于当前主流的开源与商业多模态大语言模型,并可集成至现有Ref-AVS系统中,用于检测分割失败案例及支持下游分割性能改进。数据与代码将在https://github.com/jasongief/MQA-RefAVS发布。