Audio-language models (ALMs) process sounds to provide a linguistic description of sound-producing events and scenes. Recent advances in computing power and dataset creation have led to significant progress in this domain. This paper surveys existing datasets used for training audio-language models, emphasizing the recent trend towards using large, diverse datasets to enhance model performance. Key sources of these datasets include the Freesound platform and AudioSet that have contributed to the field's rapid growth. Although prior surveys primarily address techniques and training details, this survey categorizes and evaluates a wide array of datasets, addressing their origins, characteristics, and use cases. It also performs a data leak analysis to ensure dataset integrity and mitigate bias between datasets. This survey was conducted by analyzing research papers up to and including December 2023, and does not contain any papers after that period.
翻译:音频-语言模型通过处理声音信号,对发声事件与场景进行语言描述。计算能力的提升与数据集的构建推动了该领域的显著进展。本文系统综述了用于训练音频-语言模型的现有数据集,重点探讨了利用大规模多样化数据集提升模型性能的最新趋势。Freesound平台与AudioSet等关键数据源推动了该领域的快速发展。相较于以往主要关注技术与训练细节的综述,本研究对多类数据集进行了分类与评估,涵盖其来源、特征及应用场景。同时开展了数据泄露分析,以确保数据集完整性并降低数据集间的偏差。本综述基于截至2023年12月(含)的研究文献进行分析,不包含此后发表的论文。