The enormous demand for annotated data brought forth by deep learning techniques has been accompanied by the problem of annotation noise. Although this issue has been widely discussed in machine learning literature, it has been relatively unexplored in the context of "multi-label classification" (MLC) tasks which feature more complicated kinds of noise. Additionally, when the domain in question has certain logical constraints, noisy annotations often exacerbate their violations, making such a system unacceptable to an expert. This paper studies the effect of label noise on domain rule violation incidents in the MLC task, and incorporates domain rules into our learning algorithm to mitigate the effect of noise. We propose the Domain Obedient Self-supervised Training (DOST) paradigm which not only makes deep learning models more aligned to domain rules, but also improves learning performance in key metrics and minimizes the effect of annotation noise. This novel approach uses domain guidance to detect offending annotations and deter rule-violating predictions in a self-supervised manner, thus making it more "data efficient" and domain compliant. Empirical studies, performed over two large scale multi-label classification datasets, demonstrate that our method results in improvement across the board, and often entirely counteracts the effect of noise.
翻译:深度学习技术对标注数据的巨大需求催生了标注噪声问题。尽管该问题已在机器学习文献中得到广泛讨论,但在具有更复杂噪声类型的"多标签分类"(MLC)任务背景下,相关研究仍相对欠缺。此外,当目标领域存在特定逻辑约束时,含噪标注往往会加剧约束违例现象,使系统难以被领域专家所接受。本文研究了MLC任务中标签噪声对领域规则违例事件的影响,并将领域规则融入学习算法以缓解噪声效应。我们提出领域服从自监督训练(DOST)范式,该范式不仅能使深度学习模型更契合领域规则,还能在关键指标上提升学习性能,并最大程度降低标注噪声的影响。这种新方法利用领域引导信息以自监督方式检测违规标注并抑制违背规则的预测,从而提升"数据效率"和领域合规性。在两个大规模多标签分类数据集上的实证研究表明,本方法实现了全面性能提升,且通常能完全抵消噪声带来的负面影响。