Active learning (AL) techniques reduce labeling costs for training neural machine translation (NMT) models by selecting smaller representative subsets from unlabeled data for annotation. Diversity sampling techniques select heterogeneous instances, while uncertainty sampling methods select instances with the highest model uncertainty. Both approaches have limitations - diversity methods may extract varied but trivial examples, while uncertainty sampling can yield repetitive, uninformative instances. To bridge this gap, we propose HUDS, a hybrid AL strategy for domain adaptation in NMT that combines uncertainty and diversity for sentence selection. HUDS computes uncertainty scores for unlabeled sentences and subsequently stratifies them. It then clusters sentence embeddings within each stratum using k-MEANS and computes diversity scores by distance to the centroid. A weighted hybrid score that combines uncertainty and diversity is then used to select the top instances for annotation in each AL iteration. Experiments on multi-domain German-English datasets demonstrate the better performance of HUDS over other strong AL baselines. We analyze the sentence selection with HUDS and show that it prioritizes diverse instances having high model uncertainty for annotation in early AL iterations.
翻译:主动学习技术通过从未标注数据中选择更具代表性的子集进行标注,从而降低训练神经机器翻译模型的标注成本。多样性采样技术选取异质性实例,而不确定性采样方法则选择模型不确定性最高的实例。两种方法均有局限性——多样性方法可能提取多样化但琐碎的样本,而不确定性采样则可能产生重复且无信息价值的实例。为弥补这一不足,我们提出HUDS——一种面向神经机器翻译领域自适应的混合主动学习策略,该策略在句子选择中融合了不确定性与多样性。HUDS为未标注句子计算不确定性分数,随后对其进行分层处理,接着在各分层内使用k-MEANS对句子嵌入进行聚类,并通过与聚类中心的距离计算多样性分数。最后,结合不确定性与多样性的加权混合分数被用于在每个主动学习轮次中选择最需标注的实例。在多领域德英数据集上的实验表明,HUDS的性能优于其他强主动学习基线方法。通过分析HUDS的句子选择过程,我们发现其在早期主动学习轮次中优先选择具有高模型不确定性的多样性实例进行标注。