Towards Generating Diverse Audio Captions via Adversarial Training

Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years. Captions generated by existing models are generally faithful to the content of audio clips, however, these machine-generated captions are often deterministic (e.g., generating a fixed caption for a given audio clip), simple (e.g., using common words and simple grammar), and generic (e.g., generating the same caption for similar audio clips). When people are asked to describe the content of an audio clip, different people tend to focus on different sound events and describe an audio clip diversely from various aspects using distinct words and grammar. We believe that an audio captioning system should have the ability to generate diverse captions, either for a fixed audio clip, or across similar audio clips. To this end, we propose an adversarial training framework based on a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions, and the hybrid discriminators assess the generated captions from different criteria, such as their naturalness and semantics. We conduct experiments on the Clotho dataset. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.

翻译：自动音频描述是一项跨模态翻译任务，旨在用自然语言句子描述音频片段的内容。近年来，该任务受到越来越多的关注并取得了实质性进展。现有模型生成的描述通常忠实于音频片段的内容，但这些机器生成的描述往往具有确定性（例如对给定音频片段生成固定描述）、简单性（例如使用常见词汇和简单语法）和通用性（例如对相似音频片段生成相同描述）。当人们被要求描述音频片段内容时，不同个体倾向于关注不同的声音事件，并使用不同的词汇和语法从多个角度对音频片段进行多样化描述。我们认为音频描述系统应具备生成多样化描述的能力，无论是针对固定音频片段还是跨相似音频片段。为此，我们提出基于条件生成对抗网络（C-GAN）的对抗训练框架以提升音频描述系统的多样性。描述生成器与两个混合判别器通过竞争机制进行联合学习：描述生成器可采用任意标准编码器-解码器描述模型来生成描述，而混合判别器则从不同准则（如自然度与语义完整性）评估生成描述。我们在Clotho数据集上进行实验，结果表明相较于现有先进方法，所提模型能够生成具有更佳多样性的描述。