Annotated datasets are an essential ingredient to train, evaluate, compare and productionalize supervised machine learning models. It is therefore imperative that annotations are of high quality. For their creation, good quality management and thereby reliable quality estimates are needed. Then, if quality is insufficient during the annotation process, rectifying measures can be taken to improve it. Quality estimation is often performed by having experts manually label instances as correct or incorrect. But checking all annotated instances tends to be expensive. Therefore, in practice, usually only subsets are inspected; sizes are chosen mostly without justification or regard to statistical power and more often than not, are relatively small. Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate. Using unnecessarily large sample sizes costs money that could be better spent, for instance on more annotations. Therefore, we first describe in detail how to use confidence intervals for finding the minimal sample size needed to estimate the annotation error rate. Then, we propose applying acceptance sampling as an alternative to error rate estimation We show that acceptance sampling can reduce the required sample sizes up to 50% while providing the same statistical guarantees.
翻译:标注数据集是训练、评估、比较及部署监督式机器学习模型的关键要素。因此,确保标注的高质量至关重要。在创建标注数据时,需要良好的质量管理以及可靠的质量评估方法。这样,若在标注过程中发现质量不足,便可采取纠正措施以提升质量。质量评估通常由专家手动标注实例的正确与否来完成。然而,检查所有标注实例往往成本高昂。因此,实践中通常仅检查子集;样本量的选择大多缺乏依据,未考虑统计功效,且往往相对较小。但基于小样本量的估计可能导致错误率估计值不精确。使用不必要的过大样本量则会浪费资金,这些资金本可用于更多标注工作。因此,我们首先详细描述了如何利用置信区间来确定估计标注错误率所需的最小样本量。随后,我们提出将验收抽样作为错误率估计的替代方法。我们证明,验收抽样在提供相同统计保证的同时,可将所需样本量减少高达50%。