We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.
翻译:本文提出Synthio,一种利用合成数据增强小规模音频分类数据集的新方法。我们的目标是在有限标注数据条件下提升音频分类准确率。传统数据增强技术通过人工变换(如添加随机噪声或掩蔽片段)生成数据,难以捕捉真实音频中存在的多样性。为克服这一局限,我们提出使用文本到音频(T2A)扩散模型生成的合成音频进行数据集增强。然而,生成有效的增强数据面临双重挑战:生成数据不仅需在声学特性上与原始小规模数据集保持一致性,还应具备足够的组合多样性。针对第一项挑战,我们通过偏好优化使T2A模型的生成结果与小规模数据集对齐,确保生成数据的声学特征与原始数据集保持一致。针对第二项挑战,我们提出一种创新的描述生成技术,利用大语言模型的推理能力实现:(1)生成多样化且语义丰富的音频描述;(2)通过迭代优化提升描述质量。生成的描述随后用于引导对齐后的T2A模型。我们在十个数据集和四种模拟有限数据场景中对Synthio进行了全面评估。实验结果表明,使用仅在弱标注AudioSet上训练的T2A模型时,我们的方法始终优于所有基线模型,提升幅度达0.1%-39%。