High-quality data is crucial for the success of machine learning, but labeling large datasets is often a time-consuming and costly process. While semi-supervised learning can help mitigate the need for labeled data, label quality remains an open issue due to ambiguity and disagreement among annotators. Thus, we use proposal-guided annotations as one option which leads to more consistency between annotators. However, proposing a label increases the probability of the annotators deciding in favor of this specific label. This introduces a bias which we can simulate and remove. We propose a new method CleverLabel for Cost-effective LabEling using Validated proposal-guidEd annotations and Repaired LABELs. CleverLabel can reduce labeling costs by up to 30.0%, while achieving a relative improvement in Kullback-Leibler divergence of up to 29.8% compared to the previous state-of-the-art on a multi-domain real-world image classification benchmark. CleverLabel offers a novel solution to the challenge of efficiently labeling large datasets while also improving the label quality.
翻译:高质量数据对机器学习的成功至关重要,但大型数据集的标注往往耗时且成本高昂。虽然半监督学习有助于缓解对标注数据的需求,但由于标注者之间的歧义与分歧,标签质量仍是悬而未决的问题。因此,我们采用提议引导式标注作为实现标注者间更高一致性的方案之一。然而,提议标签会增加标注者倾向选择该特定标签的概率,这引入了一种可模拟并消除的偏差。我们提出新方法CleverLabel(基于验证性提议引导标注与标签修复的高效标注方案),能在多领域真实图像分类基准测试中,将标注成本降低高达30.0%,同时相较于先前最优方法,Kullback-Leibler散度相对提升达29.8%。CleverLabel为高效标注大规模数据集并提升标签质量提供了创新解决方案。