AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations

Multi-modal learning in the audio-language domain has seen significant advancements in recent years. However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks. Existing audio-language datasets are notably smaller, and manual labeling is hindered by the need to listen to entire audio clips for accurate labeling. Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations. Leveraging a Large Language Model, we generate descriptions of augmented audio clips with a prompt template. This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models. Integration of our dataset improves models performance on benchmarks by providing diversified and better-aligned examples. Notably, our dataset addresses the absence of modifiers (adjectives and adverbs) in existing datasets. By enabling models to learn these concepts, and generating hard negative examples during training, we achieve state-of-the-art performance on multiple benchmarks.

翻译：近年来，音频-语言领域的多模态学习取得了显著进展。然而，与图像-语言任务相比，音频-语言学习面临数据规模有限且质量较低的挑战。现有音频-语言数据集规模明显偏小，且人工标注因需完整听取音频片段以确保准确性而受到制约。本研究提出一种系统化方法，通过自然语言标签及对应音频信号处理操作对音频片段进行增强，从而生成音频-描述对。我们利用大语言模型，结合提示模板生成增强音频片段的描述。该可扩展方法构建了AudioSetMix——一个面向文本与音频相关模型的高质量训练数据集。该数据集的集成通过提供多样化且对齐更优的样本，显著提升了模型在基准测试中的性能。值得注意的是，本数据集解决了现有数据集缺乏修饰词（形容词与副词）的问题，使模型能够学习这些概念，并通过在训练中生成困难负样本，在多项基准测试中达到了最先进水平。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日