When humans perceive the world, they naturally integrate multiple audio-visual tasks within dynamic, real-world scenes. However, current works such as event localization, parsing, segmentation and question answering are mostly explored individually, making it challenging to comprehensively understand complex audio-visual scenes and explore inter-task relationships. Hence, we propose \textbf{AV-Unified}, a unified framework that enables joint learning across a wide range of audio-visual scene understanding tasks. AV-Unified standardizes the diverse input-output formats of each task and incorporates a multi-scale spatiotemporal perception network to effectively capture audio-visual associations. Specifically, we unify the inputs and outputs of all supported tasks by converting them into sequences of discrete tokens, establishing a shared representation that allows a single architecture to be trained jointly across heterogeneous varied datasets. Considering the varying temporal granularity of audio-visual events, a multi-scale temporal perception module is designed to capture key cues. Meanwhile, to overcome the lack of auditory supervision in the visual domain, we design a cross-modal guidance-based spatial perception module that models spatial audio-visual associations. Furthermore, task-specific text prompts are employed to enhance the model's adaptability and task-awareness. Extensive experiments on benchmark datasets (e.g., AVE, LLP, MUSIC-AVQA, VGG-SS and AVS) demonstrate the effectiveness of AV-Unified across temporal, spatial, and spatiotemporal tasks.
翻译:人类在感知世界时,自然会在动态的真实场景中整合多种视听任务。然而,当前的研究如事件定位、解析、分割和问答大多被独立探索,这使得全面理解复杂的视听场景并探索任务间关系变得困难。因此,我们提出了 **AV-Unified**,一个能够跨广泛视听场景理解任务进行联合学习的统一框架。AV-Unified 标准化了每个任务多样化的输入-输出格式,并融入了一个多尺度时空感知网络,以有效捕捉视听关联。具体而言,我们通过将所有支持任务的输入和输出转换为离散标记序列来实现统一,从而建立了一个共享表示,使得单一架构能够在异构多样的数据集上进行联合训练。考虑到视听事件具有不同的时间粒度,我们设计了一个多尺度时间感知模块来捕捉关键线索。同时,为了克服视觉领域中听觉监督的缺失,我们设计了一个基于跨模态引导的空间感知模块,以建模空间上的视听关联。此外,我们采用了任务特定的文本提示来增强模型的适应性和任务感知能力。在多个基准数据集(如 AVE、LLP、MUSIC-AVQA、VGG-SS 和 AVS)上进行的大量实验证明了 AV-Unified 在时间、空间以及时空任务上的有效性。