Aligning Audio-Visual Joint Representations with an Agentic Workflow

Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, non-synchronization may appear between audio and video streams. These non-strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM-based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi-modal LLM to convert audio and visual data into language descriptions separately (i.e., tool use). Then, AVAgent reasons whether this paired data is aligned well and plans to edit the audio signal if needed (i.e., planning). The audio editing is executed by predefined actions that filter noise or augment data. Moreover, we use a VLM to evaluate how modified audio signals match the visual content and provide feedback to AVAgent (i.e., reflection). The tool use, planning, and reflection steps operate cyclically to become an agentic workflow where audio signals are gradually aligned to visual content. To this end, existing methods can directly leverage the aligned AV data via our agentic workflow to improve AV joint representations. The experimental results comprehensively demonstrate the state-of-the-art performance of the proposed approach against previous baselines in diverse downstream tasks.

翻译：视觉内容及其伴随的音频信号天然构成一种联合表征，以提升视听（AV）相关应用的效果。尽管已有研究开发了多种视听表征学习框架，但实现高质量表征所需的数据对齐重要性常被忽视。我们观察到，音频信号可能包含背景噪声干扰，且音频与视频流之间可能出现不同步现象。这些非严格的数据对齐限制了表征质量，并降低了应用性能。本文提出从数据中心的视角，通过将音频信号与视觉数据对齐来改进视听联合表征。我们的对齐操作在一个由基于大语言模型的智能体（命名为AVAgent）控制的智能体工作流中执行。对于每个输入的视听数据对，AVAgent首先使用多模态大语言模型分别将音频和视觉数据转换为语言描述（即工具使用）。随后，AVAgent推理该配对数据是否良好对齐，并在需要时规划音频信号的编辑方案（即规划）。音频编辑通过预定义的动作（如噪声滤除或数据增强）执行。此外，我们利用视觉语言模型评估修改后的音频信号与视觉内容的匹配程度，并向AVAgent提供反馈（即反思）。工具使用、规划和反思步骤循环运行，形成一个智能体工作流，使音频信号逐步与视觉内容对齐。由此，现有方法可直接通过我们的智能体工作流利用对齐后的视听数据来改进联合表征。实验结果全面证明了所提方法在多种下游任务中相较于先前基线模型的先进性能。