Knowing exactly how many data points need to be labeled to achieve a certain model performance is a hugely beneficial step towards reducing the overall budgets for annotation. It pertains to both active learning and traditional data annotation, and is particularly beneficial for low resource scenarios. Nevertheless, it remains a largely under-explored area of research in NLP. We therefore explored various techniques for estimating the training sample size necessary to achieve a targeted performance value. We derived a simple yet effective approach to predict the maximum achievable model performance based on small amount of training samples - which serves as an early indicator during data annotation for data quality and sample size determination. We performed ablation studies on four language understanding tasks, and showed that the proposed approach allows us to forecast model performance within a small margin of mean absolute error (~ 0.9%) with only 10% data.
翻译:精确知道需要标注多少数据点才能达到特定模型性能,是显著降低标注总预算的重要步骤。这不仅涉及主动学习和传统数据标注,而且对低资源场景尤其有益。然而,这仍然是自然语言处理领域中一个很大程度上未被充分探索的研究方向。因此,我们探索了各种用以估计达到目标性能所需的训练样本量的技术。我们提出了一种简单而有效的方法,基于少量训练样本预测模型所能达到的最大性能——这可以作为数据标注过程中,用于判断数据质量和确定样本量的早期指标。我们在四项语言理解任务上进行了消融研究,结果表明,所提出的方法仅使用10%的数据,就能以较小的平均绝对误差(约0.9%)预测模型性能。