AudioSet is one of the most used and largest datasets in audio tagging, containing about 2 million audio samples that are manually labeled with 527 event categories organized into an ontology. However, the annotations contain inconsistencies, particularly where categories that should be labeled as positive according to the ontology are frequently mislabeled as negative. To address this issue, we apply Hierarchical Label Propagation (HLP), which propagates labels up the ontology hierarchy, resulting in a mean increase in positive labels per audio clip from 1.98 to 2.39 and affecting 109 out of the 527 classes. Our results demonstrate that HLP provides performance benefits across various model architectures, including convolutional neural networks (PANN's CNN6 and ConvNeXT) and transformers (PaSST), with smaller models showing more improvements. Finally, on FSD50K, another widely used dataset, models trained on AudioSet with HLP consistently outperformed those trained without HLP. Our source code will be made available on GitHub.
翻译:AudioSet是音频标注领域最常用且规模最大的数据集之一,包含约200万个音频样本,这些样本已通过人工标注,按照本体论组织为527个事件类别。然而,其标注存在不一致性,特别是根据本体论本应标注为正例的类别经常被误标为负例。为解决此问题,我们应用层次化标签传播(HLP)方法,该方法沿本体论层次结构向上传播标签,使每个音频片段的平均正标签数从1.98增至2.39,并影响了527个类别中的109个。实验结果表明,HLP能为多种模型架构带来性能提升,包括卷积神经网络(PANN的CNN6与ConvNeXT)和Transformer模型(PaSST),且规模较小的模型展现出更显著的改进。最后,在另一个广泛使用的数据集FSD50K上,基于HLP增强版AudioSet训练的模型始终优于未使用HLP训练的模型。我们的源代码将在GitHub上公开。