We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space. To evaluate the proposed methods, we built an AudioDiffCaps dataset consisting of pairs of similar but slightly different audio clips with human-annotated descriptions of their differences. The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.
翻译:我们提出音频差异描述(ADC)作为音频描述的一个新扩展任务,用于描述输入对中相似但略有差异的音频片段之间的语义差异。ADC解决了传统音频描述有时会为相似音频片段生成相似描述、无法描述内容差异的问题。我们还提出了一种基于交叉注意力集中机制的Transformer编码器,通过对比一对音频片段来提取差异,并采用相似-差异分离方法在潜在空间中强化差异。为评估所提方法,我们构建了AudioDiffCaps数据集,其中包含成对相似但略有差异的音频片段及其人工标注的差异描述。在AudioDiffCaps数据集上的实验表明,所提方法能有效解决ADC任务,并通过可视化Transformer编码器中的注意力权重,改善了对差异的提取效果。