Even though data annotation is extremely important for interpretability, research and development of artificial intelligence solutions, most research efforts such as active learning or few-shot learning focus on the sample efficiency problem. This paper studies the neglected complementary problem of getting annotated data given a predictor. For the simple binary classification setting, we present the spectrum ranging from optimal general solutions to practical efficient methods. The problem is framed as the full annotation of a binary classification dataset with the minimal number of yes/no questions when a predictor is available. For the case of general binary questions the solution is found in coding theory, where the optimal questioning strategy is given by the Huffman encoding of the possible labelings. However, this approach is computationally intractable even for small dataset sizes. We propose an alternative practical solution based on several heuristics and lookahead minimization of proxy cost functions. The proposed solution is analysed, compared with optimal solutions and evaluated on several synthetic and real-world datasets. On these datasets, the method allows a significant improvement ($23-86\%$) in annotation efficiency.
翻译:尽管数据标注对人工智能解决方案的可解释性、研究和开发至关重要,但大多数研究(如主动学习或小样本学习)主要集中在样本效率问题上。本文研究了一个被忽视的互补问题:在给定预测器的情况下获取已标注数据。针对简单的二分类场景,我们展示了从最优通用解决方案到实用高效方法的完整谱系。该问题被定义为:在存在预测器的条件下,以最少的“是/否”问题数量完整标注一个二分类数据集。针对通用二值问题情形,其解决方案源于编码理论——最优询问策略由可能标注的哈夫曼编码给出。然而,即便对于小规模数据集,该方法的计算复杂度也极高。我们提出了一种基于多种启发式策略及代理代价函数超前最小化的实用替代方案。本文对该方案进行了分析,与最优解进行了对比,并在多个合成数据集与真实数据集上进行了评估。实验表明,该方法在标注效率上实现了显著提升(23%–86%)。