Label aggregation such as majority voting is commonly used to resolve annotator disagreement in dataset creation. However, this may disregard minority values and opinions. Recent studies indicate that learning from individual annotations outperforms learning from aggregated labels, though they require a considerable amount of annotation. Active learning, as an annotation cost-saving strategy, has not been fully explored in the context of learning from disagreement. We show that in the active learning setting, a multi-head model performs significantly better than a single-head model in terms of uncertainty estimation. By designing and evaluating acquisition functions with annotator-specific heads on two datasets, we show that group-level entropy works generally well on both datasets. Importantly, it achieves performance in terms of both prediction and uncertainty estimation comparable to full-scale training from disagreement, while saving up to 70% of the annotation budget.
翻译:在数据集构建过程中,多数投票等标签聚合方法常被用于解决标注者分歧。然而,这种做法可能忽视少数群体的价值观和意见。近年研究表明,学习个体标注结果优于学习聚合标签,但前者需要大量标注数据。作为降低标注成本的策略,主动学习在"从分歧中学习"的背景下尚未得到充分探索。本研究发现,在主动学习场景中,多头模型在不确定性估计方面显著优于单头模型。通过在两个数据集上设计并评估基于标注者专属头的采集函数,我们证实群体级熵在两个数据集上均表现出良好的通用性。关键的是,该方法在预测性能和不确定性估计方面均能达到与全量训练(基于分歧学习)相当的效果,同时节省高达70%的标注预算。