Active learning (AL) techniques reduce labeling costs for training neural machine translation (NMT) models by selecting smaller representative subsets from unlabeled data for annotation. Diversity sampling techniques select heterogeneous instances, while uncertainty sampling methods select instances with the highest model uncertainty. Both approaches have limitations - diversity methods may extract varied but trivial examples, while uncertainty sampling can yield repetitive, uninformative instances. To bridge this gap, we propose Hybrid Uncertainty and Diversity Sampling (HUDS), an AL strategy for domain adaptation in NMT that combines uncertainty and diversity for sentence selection. HUDS computes uncertainty scores for unlabeled sentences and subsequently stratifies them. It then clusters sentence embeddings within each stratum and computes diversity scores by distance to the centroid. A weighted hybrid score that combines uncertainty and diversity is then used to select the top instances for annotation in each AL iteration. Experiments on multi-domain German-English and French-English datasets demonstrate the better performance of HUDS over other strong AL baselines. We analyze the sentence selection with HUDS and show that it prioritizes diverse instances having high model uncertainty for annotation in early AL iterations.
翻译:主动学习(AL)技术通过从未标注数据中选择较小的代表性子集进行标注,以降低训练神经机器翻译(NMT)模型的标注成本。多样性采样技术选择异构实例,而不确定性采样方法则选择模型不确定性最高的实例。两种方法均存在局限——多样性方法可能抽取多样但平凡的例句,而不确定性采样则可能产生重复、信息量不足的实例。为弥补这一差距,我们提出混合不确定性与多样性采样(HUDS),一种用于NMT领域自适应的AL策略,该策略结合不确定性与多样性进行句子选择。HUDS计算未标注句子的不确定性分数并随后对其进行分层。接着在每个层级内对句子嵌入进行聚类,并通过到质心的距离计算多样性分数。随后使用结合不确定性与多样性的加权混合分数,在每个AL迭代中选择最优实例进行标注。在多领域德语-英语及法语-英语数据集上的实验表明,HUDS相较于其他强AL基线具有更优性能。我们分析了HUDS的句子选择机制,并证明其在早期AL迭代中优先选择具有高模型不确定性的多样化实例进行标注。