Robust Audio Tagging under Class-wise Supervision Unreliability

Weakly labeled datasets such as AudioSet have driven recent progress in audio tagging. However, annotation quality varies across sound classes. Labels may be incomplete, ambiguous, or unreliable, which introduces class-dependent supervision bias during optimisation. The issue becomes harder as real and generated audio are increasingly mixed in training, and generated samples do not always match their intended semantic labels. Prior work mainly addressed unreliable supervision from missing-positive labels, while this paper targets three other sources of unreliable supervision: spurious additions, misassignments between similar classes, and weakened label evidence. These effects introduce class-dependent optimisation bias that is not explicitly modeled by most existing methods. To bridge this gap, the paper proposes a Class-wise Supervision Unreliability (CSU) framework that controls supervision strength at the class level during training. CSU learns a separate unreliability parameter for each class and down-weights less reliable supervision without changing the model architecture or inference process. To support evaluations, this paper also introduces ESC-FreeGen50, a manually verified benchmark of 50 sound classes that combines real and generated audio. Experiments on controlled benchmarks and AudioSet show that CSU improves robustness across different architectures and different sources of supervision unreliability. The results indicate that explicit class-wise modeling of supervision unreliability is an effective and practical strategy for robust audio tagging under large-scale weakly labeled training. Code and data are available at: https://github.com/Yuanbo2020/CSU

翻译：弱标注数据集（如AudioSet）推动了音频标注领域的最新进展。然而，不同声音类别的标注质量参差不齐，标签可能存在不完整、歧义或不可靠的问题，从而在优化过程中引入类别相关的监督偏差。随着真实音频与生成音频在训练数据中日益混合，且生成样本并不总能匹配其预设语义标签，这一问题变得更加棘手。现有研究主要针对缺失正标签导致的不可靠监督，而本文则聚焦于另外三类不可靠监督源：虚假新增标签、相似类别间的错配标签以及弱化的标签证据。这些效应引入了现有方法通常未显式建模的类别相关优化偏差。为弥补这一空白，本文提出类别级监督不可靠性（CSU）框架，在训练过程中从类别层面控制监督强度。CSU为每个类别学习独立的不可靠性参数，在不改变模型架构或推理流程的情况下降低低可靠性监督的权重。为支持评估，本文还引入ESC-FreeGen50——一个包含50个声音类别、融合真实与生成音频的人工校验基准数据集。在受控基准与AudioSet上的实验表明，CSU能有效提升不同架构及多种监督不可靠性来源下的鲁棒性。结果表明，显式建模类别级监督不可靠性是大规模弱标注训练下实现鲁棒音频标注的一种高效且实用的策略。代码与数据可通过以下链接获取：https://github.com/Yuanbo2020/CSU