Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods over-specialize on each task, overlooking the fact that these instances often occur in the same video to form the complete video content. In this work, we present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time. UniAV can leverage diverse data available in task-specific datasets, allowing the model to learn and share mutually beneficial knowledge across tasks and modalities. To tackle the challenges posed by substantial variations in datasets (size/domain/duration) and distinct task characteristics, we propose to uniformly encode visual and audio modalities of all videos to derive generic representations, while also designing task-specific experts to capture unique knowledge for each task. Besides, we develop a unified language-aware classifier by utilizing a pre-trained text encoder, enabling the model to flexibly detect various types of instances and previously unseen ones by simply changing prompts during inference. UniAV outperforms its single-task counterparts by a large margin with fewer parameters, achieving on-par or superior performances compared to state-of-the-art task-specific methods across ActivityNet 1.3, DESED and UnAV-100 benchmarks.
翻译:视频定位任务旨在时间上定位视频中的特定实例,包括时序动作定位(TAL)、声音事件检测(SED)以及视听事件定位(AVEL)。现有方法过度专注于单一任务,忽略了这些实例常共存于同一视频中以构成完整视频内容的事实。本文提出UniAV,一种统一视听感知网络,首次实现了TAL、SED与AVEL任务的联合学习。UniAV能够利用任务特定数据集中可用的多样化数据,使模型能够跨任务和模态学习并共享互益知识。为应对数据集(规模/领域/时长)差异显著及任务特性各异带来的挑战,我们提出对所有视频的视觉与音频模态进行统一编码以获取通用表征,同时设计任务特定专家模块以捕捉各任务的独特知识。此外,我们通过利用预训练文本编码器构建了统一的语言感知分类器,使模型能够在推理时仅通过修改提示词即可灵活检测多种类型实例及先前未见过的实例。UniAV以更少的参数量大幅超越单任务模型,在ActivityNet 1.3、DESED和UnAV-100基准测试中达到与当前最优任务特定方法相当或更优的性能。