Audio event detection is a widely studied audio processing task, with applications ranging from self-driving cars to healthcare. In-the-wild datasets such as Audioset have propelled research in this field. However, many efforts typically involve manual annotation and verification, which is expensive to perform at scale. Movies depict various real-life and fictional scenarios which makes them a rich resource for mining a wide-range of audio events. In this work, we present a dataset of audio events called Subtitle-Aligned Movie Sounds (SAM-S). We use publicly-available closed-caption transcripts to automatically mine over 110K audio events from 430 movies. We identify three dimensions to categorize audio events: sound, source, quality, and present the steps involved to produce a final taxonomy of 245 sounds. We discuss the choices involved in generating the taxonomy, and also highlight the human-centered nature of sounds in our dataset. We establish a baseline performance for audio-only sound classification of 34.76% mean average precision and show that incorporating visual information can further improve the performance by about 5%. Data and code are made available for research at https://github.com/usc-sail/mica-subtitle-aligned-movie-sounds
翻译:声音事件检测是一项广泛研究的音频处理任务,其应用涵盖自动驾驶到医疗保健等领域。像AudioSet这样的野外数据集推动了该领域的研究。然而,许多工作通常涉及人工标注和验证,这在规模上成本高昂。电影描绘了各种现实与虚构场景,这使其成为挖掘广泛声音事件的丰富资源。本文提出一个名为"字幕对齐电影声音"(SAM-S)的声音事件数据集。我们利用公开的隐藏式字幕文本,从430部电影中自动挖掘超过11万个声音事件。我们确定了分类声音事件的三个维度:声音、来源、质量,并介绍了构建最终包含245种声音分类体系的具体步骤。我们讨论了生成分类体系时的选择考量,并强调了数据集中声音以人类为中心的特性。我们建立了音频-only声音分类的基线性能,平均精度均值为34.76%,并证明融合视觉信息可进一步提升约5%的性能。数据集与代码已在https://github.com/usc-sail/mica-subtitle-aligned-movie-sounds 公开供研究使用。