Researchers have traditionally recruited native speakers to provide annotations for widely used benchmark datasets. However, there are languages for which recruiting native speakers can be difficult, and it would help to find learners of those languages to annotate the data. In this paper, we investigate whether language learners can contribute annotations to benchmark datasets. In a carefully controlled annotation experiment, we recruit 36 language learners, provide two types of additional resources (dictionaries and machine-translated sentences), and perform mini-tests to measure their language proficiency. We target three languages, English, Korean, and Indonesian, and the four NLP tasks of sentiment analysis, natural language inference, named entity recognition, and machine reading comprehension. We find that language learners, especially those with intermediate or advanced levels of language proficiency, are able to provide fairly accurate labels with the help of additional resources. Moreover, we show that data annotation improves learners' language proficiency in terms of vocabulary and grammar. One implication of our findings is that broadening the annotation task to include language learners can open up the opportunity to build benchmark datasets for languages for which it is difficult to recruit native speakers.
翻译:研究人员传统上招募母语人士为广泛使用的基准数据集提供标注。然而,有些语言难以招募到母语人士,因此找到这些语言的学习者来标注数据将有所帮助。本文探究语言学习者能否为基准数据集贡献标注。在一项严格控制条件的标注实验中,我们招募了36名语言学习者,提供两种额外资源(词典和机器翻译句子),并通过小测试衡量他们的语言水平。我们针对三种语言——英语、韩语和印度尼西亚语——以及四项自然语言处理任务(情感分析、自然语言推理、命名实体识别和机器阅读理解)进行研究。结果发现,语言学习者,尤其是具备中高级语言水平者,能在额外资源的帮助下提供相当准确的标签。此外,我们表明数据标注通过学习者的词汇和语法技能提升了他们的语言水平。我们的发现的一个启示是:将标注任务扩展到语言学习者,可以为那些招募母语人士困难的语言建立基准数据集提供机会。