While quality estimation (QE) can play an important role in the translation process, its effectiveness relies on the availability and quality of training data. For QE in particular, high-quality labeled data is often lacking due to the high cost and effort associated with labeling such data. Aside from the data scarcity challenge, QE models should also be generalizable, i.e., they should be able to handle data from different domains, both generic and specific. To alleviate these two main issues -- data scarcity and domain mismatch -- this paper combines domain adaptation and data augmentation within a robust QE system. Our method first trains a generic QE model and then fine-tunes it on a specific domain while retaining generic knowledge. Our results show a significant improvement for all the language pairs investigated, better cross-lingual inference, and a superior performance in zero-shot learning scenarios as compared to state-of-the-art baselines.
翻译:尽管质量估计(QE)在翻译过程中可以发挥重要作用,但其有效性依赖于训练数据的可用性和质量。特别是在QE领域,由于标注数据的高成本和高工作量,高质量标注数据往往匮乏。除了数据稀缺的挑战外,QE模型还应具备泛化能力,即能够处理来自不同领域(包括通用领域和特定领域)的数据。为缓解这两个主要问题——数据稀缺与领域不匹配——本文在鲁棒的QE系统中结合了领域自适应与数据增强方法。我们的方法首先训练一个通用QE模型,然后在保留通用知识的同时针对特定领域进行微调。实验结果表明,在所有考察的语言对上均取得显著提升,跨语言推理能力更强,并且在零样本学习场景下相比现有最先进基线方法表现更优。