Multi-modal learning in the audio-language domain has seen significant advancements in recent years. However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks. Existing audio-language datasets are notably smaller, and manual labeling is hindered by the need to listen to entire audio clips for accurate labeling. Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations. Leveraging a Large Language Model, we generate descriptions of augmented audio clips with a prompt template. This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models. Integration of our dataset improves models performance on benchmarks by providing diversified and better-aligned examples. Notably, our dataset addresses the absence of modifiers (adjectives and adverbs) in existing datasets. By enabling models to learn these concepts, and generating hard negative examples during training, we achieve state-of-the-art performance on multiple benchmarks.
翻译:近年来,音频-语言领域的多模态学习取得了显著进展。然而,与图像-语言任务相比,音频-语言学习面临数据规模有限且质量较低的挑战。现有音频-语言数据集规模明显偏小,且人工标注因需完整听取音频片段以确保准确性而受到制约。本研究提出一种系统化方法,通过自然语言标签及对应音频信号处理操作对音频片段进行增强,从而生成音频-描述对。我们利用大语言模型,结合提示模板生成增强音频片段的描述。该可扩展方法构建了AudioSetMix——一个面向文本与音频相关模型的高质量训练数据集。该数据集的集成通过提供多样化且对齐更优的样本,显著提升了模型在基准测试中的性能。值得注意的是,本数据集解决了现有数据集缺乏修饰词(形容词与副词)的问题,使模型能够学习这些概念,并通过在训练中生成困难负样本,在多项基准测试中达到了最先进水平。