In this paper, we target at the text-to-audio grounding issue, namely, grounding the segments of the sound event described by a natural language query in the untrimmed audio. This is a newly proposed but challenging audio-language task, since it requires to not only precisely localize all the on- and off-sets of the desired segments in the audio, but also perform comprehensive acoustic and linguistic understandings and reason the multimodal interactions between the audio and query. To tackle those problems, the existing methods often holistically treat the query as a single unit by a global query representation. We argue that this approach suffers from several limitations. Motivated by the above considerations, we propose a novel Cross-modal Graph Interaction (CGI) model, which comprehensively models the comprehensive relations between the words in a query through a novel language graph. To capture the fine-grained interactions between the audio and query, a cross-modal attention module is introduced to assign higher weights to the keywords with more important semantics and generate the snippet-specific query representations. Furthermore, we design a cross-gating module to emphasize the crucial parts and weaken the irrelevant ones in the audio and query. We extensively evaluate the proposed CGI model on the public Audiogrounding dataset with significant improvements over several state-of-the-art methods. The ablation study demonstrate the consistent effectiveness of different modules in our model.
翻译:本文针对文本到音频定位问题进行研究,即根据自然语言查询描述,在未分割音频中定位声音事件对应的片段。这是一个新提出但具有挑战性的音频-语言任务,因为它不仅需要精确确定目标片段在音频中的起止位置,还要实现全面的声学与语言理解,并推理音频与查询之间的多模态交互。针对上述问题,现有方法通常将查询视为整体单元,通过全局查询表示进行处理。我们认为这种方法存在若干局限性。基于以上考虑,我们提出了一种新颖的跨模态图交互(CGI)模型,通过构建语言图全面建模查询中单词之间的复杂关系。为捕捉音频与查询之间的细粒度交互,引入了跨模态注意力模块,将更高权重赋予语义更重要的关键词,并生成片段特定的查询表示。此外,我们设计了跨门控模块以增强音频与查询中的关键部分、削弱无关部分。在公开的Audiogrounding数据集上的大量实验表明,所提CGI模型相较于多个现有最优方法取得显著提升。消融研究验证了模型中不同模块的一致有效性。