Selecting an effective training signal for tasks in natural language processing is difficult: expert annotations are expensive, and crowd-sourced annotations may not be reliable. At the same time, recent work in NLP has demonstrated that learning from a distribution over labels acquired from crowd annotations can be effective. However, there are many ways to acquire such a distribution, and the performance allotted by any one method can fluctuate based on the task and the amount of available crowd annotations, making it difficult to know a priori which distribution is best. This paper systematically analyzes this in the out-of-domain setting, adding to the NLP literature which has focused on in-domain evaluation, and proposes new methods for acquiring soft-labels from crowd-annotations by aggregating the distributions produced by existing methods. In particular, we propose to aggregate multiple-views of crowd annotations via temperature scaling and finding their Jensen-Shannon centroid. We demonstrate that these aggregation methods lead to the most consistent performance across four NLP tasks on out-of-domain test sets, mitigating fluctuations in performance from the individual distributions. Additionally, aggregation results in the most consistently well-calibrated uncertainty estimation. We argue that aggregating different views of crowd-annotations is an effective and minimal intervention to acquire soft-labels which induce robust classifiers despite the inconsistency of the individual soft-labeling methods.
翻译:从众包标注中选取有效的训练信号对于自然语言处理任务而言颇具挑战:专家标注成本高昂,而众包标注可能不可靠。与此同时,NLP领域的最新研究表明,从众包标注获取的标签分布中进行学习是有效的。然而,获取此类分布的方法众多,单一方法的表现会因任务及可用众包标注数量而波动,导致难以预知何种分布最优。本文系统性地在域外场景下对此进行分析——这弥补了以往NLP研究聚焦于域内评估的不足——并提出通过聚合现有方法产生的分布,从众包标注中获取软标签的新方法。具体而言,我们提议通过温度缩放计算众包标注的多视角詹森-香农质心来实现聚合。实验表明,这些聚合方法在四个NLP任务的域外测试集上取得了最稳定的表现,抑制了单一分布带来的性能波动。此外,聚合方法还产生了最一致的良好校准的不确定性估计。我们认为,聚合众包标注的不同视角是一种有效且最小干预的途径,能获取诱导鲁棒分类器的软标签,尽管单一软标签方法存在不一致性。