In recent years, active learning has been successfully applied to an array of NLP tasks. However, prior work often assumes that training and test data are drawn from the same distribution. This is problematic, as in real-life settings data may stem from several sources of varying relevance and quality. We show that four popular active learning schemes fail to outperform random selection when applied to unlabelled pools comprised of multiple data sources on the task of natural language inference. We reveal that uncertainty-based strategies perform poorly due to the acquisition of collective outliers, i.e., hard-to-learn instances that hamper learning and generalization. When outliers are removed, strategies are found to recover and outperform random baselines. In further analysis, we find that collective outliers vary in form between sources, and show that hard-to-learn data is not always categorically harmful. Lastly, we leverage dataset cartography to introduce difficulty-stratified testing and find that different strategies are affected differently by example learnability and difficulty.
翻译:近年来,主动学习已成功应用于一系列自然语言处理任务。然而,现有研究通常假设训练数据与测试数据服从相同分布,这一假设存在缺陷——实际场景中数据可能来自多个相关性及质量各异的来源。我们发现,在自然语言推理任务中,当未标注数据池包含多个数据源时,四种主流主动学习策略的表现均不优于随机选择。研究表明,基于不确定性的策略因获取集体异常值(即阻碍学习与泛化的难学习样本)而表现不佳。当移除异常值后,这些策略可恢复性能并超越随机基线。进一步分析显示,集体异常值在不同数据源间形态各异,且难学习数据并非始终具有绝对危害性。最后,我们利用数据制图法引入难度分层测试,发现不同策略受样本可学习性与难易度的影响存在差异。