The analysis, processing, and extraction of meaningful information from sounds all around us is the subject of the broader area of audio analytics. Audio captioning is a recent addition to the domain of audio analytics, a cross-modal translation task that focuses on generating natural descriptions from sound events occurring in an audio stream. In this work, we identify and improve on three main challenges in automated audio captioning: i) data scarcity, ii) imbalance or limitations in the audio captions vocabulary, and iii) the proper performance evaluation metric that can best capture both auditory and semantic characteristics. We find that generally adopted loss functions can result in an unfair vocabulary imbalance during model training. We propose two audio captioning augmentation methods that enrich the training dataset and the vocabulary size. We further underline the need for in-domain pretraining by exploring the suitability of audio encoders that were previously trained on different audio tasks. Finally, we systematically explore five performance metrics borrowed from the image captioning domain and highlight their limitations for the audio domain.
翻译:声音无处不在,对其进行分析、处理并提取有意义的信息是更广泛的音频分析领域的主题。音频字幕生成是音频分析领域的最新进展,这是一项跨模态翻译任务,旨在根据音频流中发生的声音事件生成自然语言描述。在本工作中,我们识别并改进了自动音频字幕生成中的三个主要挑战:i) 数据稀缺性,ii) 音频字幕词汇的不平衡或局限性,以及iii) 能够最佳捕捉听觉和语义特征的恰当性能评估指标。我们发现,通常采用的损失函数可能导致模型训练过程中出现不公平的词汇不平衡。我们提出了两种音频字幕增强方法,以丰富训练数据集和词汇量。我们还通过探索先前在不同音频任务上训练的音频编码器的适用性,强调了领域内预训练的必要性。最后,我们系统性地研究了五种借用于图像字幕领域的性能指标,并指出了它们在音频领域的局限性。