Annotating data via crowdsourcing is time-consuming and expensive. Due to these costs, dataset creators often have each annotator label only a small subset of the data. This leads to sparse datasets with examples that are marked by few annotators. The downside of this process is that if an annotator doesn't get to label a particular example, their perspective on it is missed. This is especially concerning for subjective NLP datasets where there is no single correct label: people may have different valid opinions. Thus, we propose using imputation methods to generate the opinions of all annotators for all examples, creating a dataset that does not leave out any annotator's view. We then train and prompt models, using data from the imputed dataset, to make predictions about the distribution of responses and individual annotations. In our analysis of the results, we found that the choice of imputation method significantly impacts soft label changes and distribution. While the imputation introduces noise in the prediction of the original dataset, it has shown potential in enhancing shots for prompts, particularly for low-response-rate annotators. We have made all of our code and data publicly available.
翻译:通过众包进行数据标注既耗时又昂贵。由于这些成本,数据集创建者通常让每个标注员只标注数据的一小部分子集。这导致稀疏数据集中每个样本仅由少数标注员标记。这一过程的弊端在于,若某个标注员未标注特定样本,其对该样本的观点就会被遗漏。这一问题在不存在唯一正确标签的主观性NLP数据集中尤为突出:不同人可能持有不同但合理的观点。为此,我们提出使用插补方法生成所有标注员对所有样本的意见,从而构建不遗漏任何标注员视角的数据集。随后,我们利用插补数据集中的数据进行训练和提示模型,以预测响应分布和个体标注。在结果分析中,我们发现插补方法的选择会显著影响软标签的变化与分布。尽管插补在原始数据集的预测中引入了噪声,但它显示出增强提示样本的潜力,尤其对低响应率的标注员效果显著。我们已公开所有代码与数据。