Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.
翻译:创建大规模高质量标注数据集是监督机器学习工作流程中的主要瓶颈。基于阈值的自动标注(TBAL)通过使用从人工获取的验证数据来寻找置信度阈值,高于该阈值的数据由机器自动标注,从而减少了对人工标注的依赖。TBAL正成为实践中广泛采用的解决方案。鉴于由此生成的数据集具有较长的生命周期和多样化的使用方式,理解何时可以信赖此类自动标注系统所获得的数据至关重要。本文首次对TBAL系统进行分析,并推导了为确保机器标注数据质量所需人工标注验证数据量的样本复杂度界限。我们的研究结果提供了两个关键见解:首先,看似性能不佳的模型也能够自动且准确地标注相当数量的未标注数据;其次,TBAL系统的一个隐蔽缺陷是验证数据的使用可能具有潜在的过高代价。这些见解共同描述了使用此类系统的前景与陷阱。我们通过在合成数据集和真实数据集上的大量实验验证了我们的理论保证。