Today, groundtruth generation relies on datasets annotated by cloud-based annotation services. These rely on human annotation, which can be prohibitively expensive. In this paper, we consider the problem of hybrid human-machine labeling, which trains a classifier to accurately auto-label part of the data set. However, training the classifier can be expensive too. We propose an iterative approach that minimizes total overall cost by, at each step, jointly determining which samples to label using humans and which to label using the trained classifier. We validate our approach on well known public data sets such as Fashion-MNIST, CIFAR-10, CIFAR-100, and ImageNet. In some cases, our approach has 6x lower overall cost relative to human labeling the entire dataset, and is always cheaper than the cheapest competing strategy.
翻译:如今,真实标注数据的生成依赖于基于云的标注服务。这些服务依赖人工标注,其成本可能高得令人望而却步。本文研究了人机混合标注问题,即训练一个分类器以准确自动标注部分数据集。然而,训练分类器的成本也可能很高。我们提出了一种迭代方法,通过在每个步骤中联合确定哪些样本应由人工标注、哪些应由已训练的分类器标注,从而最小化总成本。我们在Fashion-MNIST、CIFAR-10、CIFAR-100和ImageNet等知名公共数据集上验证了该方法。在某些情况下,相较于完全人工标注整个数据集,我们的方法总成本降低了6倍,且始终比成本最低的竞争策略更便宜。