Audiovisual instance segmentation (AVIS) requires accurately localizing and tracking sounding objects throughout video sequences. Existing methods suffer from visual bias stemming from two fundamental issues: uniform additive fusion prevents queries from specializing to different sound sources, while visual-only training objectives allow queries to converge to arbitrary salient objects. We propose Audio-Centric Query Generation using cross-attention, enabling each query to selectively attend to distinct sound sources and carry sound-specific priors into visual decoding. Additionally, we introduce Sound-Aware Ordinal Counting (SAOC) loss that explicitly supervises sounding object numbers through ordinal regression with monotonic consistency constraints, preventing visual-only convergence during training. Experiments on AVISeg benchmark demonstrate consistent improvements: +1.64 mAP, +0.6 HOTA, and +2.06 FSLA, validating that query specialization and explicit counting supervision are crucial for accurate audiovisual instance segmentation.
翻译:视听实例分割(AVIS)需要在视频序列中精确定位并跟踪发声物体。现有方法存在由两个根本问题导致的视觉偏差:均匀加性融合阻碍查询针对不同声源的特化,而纯视觉训练目标则允许查询收敛至任意显著物体。我们提出基于交叉注意力的声学中心查询生成方法,使每个查询能够选择性地关注不同声源,并将声音特异性先验信息传递至视觉解码过程。此外,我们引入声音感知序数计数(SAOC)损失函数,通过具有单调一致性约束的序数回归显式监督发声物体数量,防止训练过程中的纯视觉收敛。在AVISeg基准测试上的实验显示出一致的性能提升:mAP提升1.64,HOTA提升0.6,FSLA提升2.06,验证了查询特化和显式计数监督对于实现精确视听实例分割的关键作用。