Annotating data via crowdsourcing is time-consuming and expensive. Owing to these costs, dataset creators often have each annotator label only a small subset of the data. This leads to sparse datasets with examples that are marked by few annotators; if an annotator is not selected to label an example, their opinion regarding it is lost. This is especially concerning for subjective NLP datasets where there is no correct label: people may have different valid opinions. Thus, we propose using imputation methods to restore the opinions of all annotators for all examples, creating a dataset that does not leave out any annotator's view. We then train and prompt models with data from the imputed dataset (rather than the original sparse dataset) to make predictions about majority and individual annotations. Unfortunately, the imputed data provided by our baseline methods does not improve predictions. However, through our analysis of it, we develop a strong understanding of how different imputation methods impact the original data in order to inform future imputation techniques. We make all of our code and data publicly available.
翻译:通过众包进行数据标注既耗时又昂贵。由于这些成本,数据集创建者通常让每个标注者仅标注一小部分数据,导致数据集稀疏,每个示例仅由少数标注者标记;若某标注者未被选中标注某个示例,其对该示例的观点便会被丢失。这对于不存在唯一正确答案的主观性自然语言处理数据集而言尤为令人担忧:人们可能持有不同但合理的观点。因此,我们提出使用插补方法来恢复所有标注者对全部示例的观点,从而构建一个不遗漏任何标注者观点的数据集。随后,我们利用插补数据集(而非原始稀疏数据集)中的数据训练模型并引导其预测多数意见和个体标注。遗憾的是,我们基线方法生成的插补数据并未改善预测结果。然而,通过对其分析,我们深入理解了不同插补方法对原始数据的影响,以期为未来的插补技术提供参考。我们公开所有代码与数据。