Community researchers have developed a range of advanced audio-visual segmentation models aimed at improving the quality of sounding objects' masks. While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic. We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding, which leads to the disregard of important modality information. Generally, the anomalous phenomena are often complex and cannot be directly observed systematically. In this study, we made a pioneering effort with the proper synthetic data to categorize and analyze phenomena as two types "audio priming bias" and "visual prior" according to the source of anomalies. For audio priming bias, to enhance audio sensitivity to different intensities and semantics, a perception module specifically for audio perceives the latent semantic information and incorporates information into a limited set of queries, namely active queries. Moreover, the interaction mechanism related to such active queries in the transformer decoder is customized to adapt to the need for interaction regulating among audio semantics. For visual prior, multiple contrastive training strategies are explored to optimize the model by incorporating a biased branch, without even changing the structure of the model. During experiments, observation demonstrates the presence and the impact that has been produced by the biases of the existing model. Finally, through experimental evaluation of AVS benchmarks, we demonstrate the effectiveness of our methods in handling both types of biases, achieving competitive performance across all three subsets.
翻译:社区研究人员已开发出一系列先进的视听分割模型,旨在提升发声物体掩码的质量。尽管这些模型生成的掩码初看可能合理,但偶尔会表现出具有错误关联逻辑的异常现象。我们将此归因于现实世界固有的偏好与分布——相比复杂的视听关联,这些偏好与分布作为更简单的学习信号,导致模型忽略了重要的模态信息。通常,这些异常现象较为复杂,难以被系统性地直接观测。在本研究中,我们首次利用恰当的合成数据,根据异常来源将现象归类为"音频启动偏差"与"视觉先验"两种类型。针对音频启动偏差,为增强音频对不同强度与语义的敏感性,一个专用于音频的感知模块提取潜在语义信息,并将其融入一组有限的查询(即主动查询)中。此外,变换器解码器中与此类主动查询相关的交互机制经过定制,以适应音频语义间交互调节的需求。针对视觉先验,我们探索了多种对比训练策略,通过引入一个带偏差的分支来优化模型,甚至无需改变模型结构。实验过程中,观测结果证实了现有模型偏差的存在及其产生的影响。最后,通过对AVS基准的实验评估,我们证明了所提方法在处理两类偏差上的有效性,在全部三个子集上均取得了具有竞争力的性能。