Annotation Imputation to Individualize Predictions: Initial Studies on Distribution Dynamics and Model Predictions

Annotating data via crowdsourcing is time-consuming and expensive. Due to these costs, dataset creators often have each annotator label only a small subset of the data. This leads to sparse datasets with examples that are marked by few annotators. The downside of this process is that if an annotator doesn't get to label a particular example, their perspective on it is missed. This is especially concerning for subjective NLP datasets where there is no single correct label: people may have different valid opinions. Thus, we propose using imputation methods to generate the opinions of all annotators for all examples, creating a dataset that does not leave out any annotator's view. We then train and prompt models, using data from the imputed dataset, to make predictions about the distribution of responses and individual annotations. In our analysis of the results, we found that the choice of imputation method significantly impacts soft label changes and distribution. While the imputation introduces noise in the prediction of the original dataset, it has shown potential in enhancing shots for prompts, particularly for low-response-rate annotators. We have made all of our code and data publicly available.

翻译：通过众包进行数据标注既耗时又昂贵。由于这些成本，数据集创建者通常让每个标注员只标注数据的一小部分子集。这导致稀疏数据集中每个样本仅由少数标注员标记。这一过程的弊端在于，若某个标注员未标注特定样本，其对该样本的观点就会被遗漏。这一问题在不存在唯一正确标签的主观性NLP数据集中尤为突出：不同人可能持有不同但合理的观点。为此，我们提出使用插补方法生成所有标注员对所有样本的意见，从而构建不遗漏任何标注员视角的数据集。随后，我们利用插补数据集中的数据进行训练和提示模型，以预测响应分布和个体标注。在结果分析中，我们发现插补方法的选择会显著影响软标签的变化与分布。尽管插补在原始数据集的预测中引入了噪声，但它显示出增强提示样本的潜力，尤其对低响应率的标注员效果显著。我们已公开所有代码与数据。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日