BERT-based models have had strong performance on leaderboards, yet have been demonstrably worse in real-world settings requiring generalization. Limited quantities of training data is considered a key impediment to achieving generalizability in machine learning. In this paper, we examine the impact of training data quality, not quantity, on a model's generalizability. We consider two characteristics of training data: the portion of human-adversarial (h-adversarial), i.e., sample pairs with seemingly minor differences but different ground-truth labels, and human-affable (h-affable) training samples, i.e., sample pairs with minor differences but the same ground-truth label. We find that for a fixed size of training samples, as a rule of thumb, having 10-30% h-adversarial instances improves the precision, and therefore F1, by up to 20 points in the tasks of text classification and relation extraction. Increasing h-adversarials beyond this range can result in performance plateaus or even degradation. In contrast, h-affables may not contribute to a model's generalizability and may even degrade generalization performance.
翻译:基于BERT的模型在排行榜上表现优异,但在需要泛化能力的实际场景中却明显表现不佳。有限的训练数据量被视为机器学习中实现泛化能力的关键障碍。本文研究训练数据质量(而非数量)对模型泛化能力的影响。我们考虑训练数据的两个特性:人类对抗样本(h-adversarial)即表面差异较小但真实标签不同的样本对,以及人类亲和样本(h-affable)即差异较小但真实标签相同的样本对。研究发现,在固定训练样本规模下,经验法则表明:包含10-30%的人类对抗样本可提升精度,进而使文本分类和关系抽取任务的F1值最高提升20个百分点。超过此范围增加人类对抗样本会导致性能停滞甚至退化。相比之下,人类亲和样本不仅无助于模型泛化能力,反而可能降低泛化性能。