While data selection methods have been studied extensively in active learning, data pruning, and data augmentation settings, there is little evidence for the efficacy of these methods in industry scale settings, particularly in low-resource languages. Our work presents ways of assessing prospective training examples in those settings for their "usefulness" or "difficulty". We also demonstrate how these measures can be used in selecting important examples for training supervised machine learning models. We primarily experiment with entropy and Error L2-Norm (EL2N) scores. We use these metrics to curate high quality datasets from a large pool of \textit{Weak Signal Labeled} data, which assigns no-defect high confidence hypotheses during inference as ground truth labels. We then conduct training data augmentation experiments using these de-identified datasets and demonstrate that score-based selection can result in a 2% decrease in semantic error rate and 4%-7% decrease in domain classification error rate when compared to the baseline technique of random selection.
翻译:尽管数据选择方法在主动学习、数据修剪和数据增强等场景中已得到广泛研究,但鲜有证据表明这些方法在工业规模场景中(尤其是低资源语言环境下)的有效性。我们的工作提出了在此类场景中评估候选训练样本"有用性"或"难度"的方法,并进一步展示了如何利用这些度量指标选择重要样本来训练监督机器学习模型。我们主要针对熵与误差L2范数(EL2N)分数进行实验。我们利用这些度量指标,从大量包含弱信号标注(Weak Signal Labeled)的数据中筛选高质量数据集——该数据在推理阶段将无缺陷高置信度假设作为真实标签。随后使用这些去标识数据集开展训练数据增强实验,结果表明:与基准随机选择技术相比,基于分数的选择可使语义错误率降低2%,领域分类错误率降低4%至7%。