Annotation Imputation to Individualize Predictions: Initial Studies on Distribution Dynamics and Model Predictions

Annotating data via crowdsourcing is time-consuming and expensive. Owing to these costs, dataset creators often have each annotator label only a small subset of the data. This leads to sparse datasets with examples that are marked by few annotators; if an annotator is not selected to label an example, their opinion regarding it is lost. This is especially concerning for subjective NLP datasets where there is no correct label: people may have different valid opinions. Thus, we propose using imputation methods to restore the opinions of all annotators for all examples, creating a dataset that does not leave out any annotator's view. We then train and prompt models with data from the imputed dataset (rather than the original sparse dataset) to make predictions about majority and individual annotations. Unfortunately, the imputed data provided by our baseline methods does not improve predictions. However, through our analysis of it, we develop a strong understanding of how different imputation methods impact the original data in order to inform future imputation techniques. We make all of our code and data publicly available.

翻译：通过众包进行数据标注既耗时又昂贵。由于这些成本，数据集创建者通常让每个标注者仅标注一小部分数据，导致数据集稀疏，每个示例仅由少数标注者标记；若某标注者未被选中标注某个示例，其对该示例的观点便会被丢失。这对于不存在唯一正确答案的主观性自然语言处理数据集而言尤为令人担忧：人们可能持有不同但合理的观点。因此，我们提出使用插补方法来恢复所有标注者对全部示例的观点，从而构建一个不遗漏任何标注者观点的数据集。随后，我们利用插补数据集（而非原始稀疏数据集）中的数据训练模型并引导其预测多数意见和个体标注。遗憾的是，我们基线方法生成的插补数据并未改善预测结果。然而，通过对其分析，我们深入理解了不同插补方法对原始数据的影响，以期为未来的插补技术提供参考。我们公开所有代码与数据。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

164+阅读 · 2019年10月12日