In this paper, we aim to unveil the impact of data augmentation in audio-language multi-modal learning, which has not been explored despite its importance. We explore various augmentation methods at not only train-time but also test-time and find out that proper data augmentation can lead to substantial improvements. Specifically, applying our proposed audio-language paired augmentation PairMix, which is the first multi-modal audio-language augmentation method, outperforms the baselines for both automated audio captioning and audio-text retrieval tasks. To fully take advantage of data augmentation, we also present multi-level test-time augmentation (Multi-TTA) for the test-time. We successfully incorporate the two proposed methods and uni-modal augmentations and achieve 47.5 SPIDEr on audio captioning, which is an 18.2% relative increase over the baseline. In audio-text retrieval, the proposed methods also show an improvement in performance as well.
翻译:在本文中,我们旨在揭示数据增强在音频-语言多模态学习中的影响——尽管其重要性,但此前尚未被探索。我们不仅研究了训练时的多种增强方法,还首次探索了测试时增强,并发现适当的数据增强能够带来显著改进。具体而言,应用我们提出的音频-语言配对增强方法PairMix(这是首个多模态音频-语言增强方法),在自动音频字幕生成和音频-文本检索任务中均超越了基线方法。为充分利用数据增强的优势,我们还针对测试阶段提出了多级测试时增强(Multi-TTA)。我们成功地将这两种方法与单模态增强相结合,在音频字幕生成任务上达到了47.5 SPIDEr分数,相比基线提升了18.2%。在音频-文本检索任务中,所提方法同样展现出性能提升。