This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship between audio, enabling the generation of captions that detail intricate audio information. This method employs a reference audio along with the input audio, both of which are transformed into feature representations via a shared encoder. Captions are then generated from these differential features to describe their differences. Furthermore, a unique technique is proposed that involves mixing the input audio with additional audio, and using the additional audio as a reference. This results in the difference between the mixed audio and the reference audio reverting back to the original input audio. This allows the original input's caption to be used as the caption for their difference, eliminating the need for additional annotations for the differences. In the experiments using the Clotho and ESC50 datasets, the proposed method demonstrated an improvement in the SPIDEr score by 7% compared to conventional methods.
翻译:本研究提出了一种新颖的训练范式——音频差异学习,用于改进音频字幕生成任务。该学习方法的核心思想是构建一个保留音频间关系的特征表示空间,从而生成能够描述复杂音频信息的字幕。该方法在输入音频的基础上引入参考音频,两者通过共享编码器转换为特征表示,随后从这些差分特征中生成描述两者差异的字幕。此外,本文提出了一种独特技术:将输入音频与额外音频进行混合,并以该额外音频作为参考。此时混合音频与参考音频之间的差异恰好恢复为原始输入音频,因此原始输入音频的字幕可直接作为其差分结果的字幕,无需为差分结果额外标注。在Clotho和ESC50数据集上的实验表明,与传统方法相比,所提方法将SPIDEr评分提升了7%。