Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched configurations reduce performance. Chapter 5 proposes a QE-guided in-context learning method for large language models. QE models select examples that improve translation quality without parameter updates and outperform standard retrieval methods. The approach also supports a reference-free setup, reducing reliance on a single reference set. These results show that domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building MT and QE systems that perform reliably in domain-specific settings.
翻译:机器翻译(MT)与质量估计(QE)在通用领域表现良好,但在领域不匹配时性能会下降。本论文通过一系列以数据为核心的贡献,研究如何将MT和QE系统适配至专业领域。第2章提出了一种基于相似性的MT数据选择方法,小型且精准的领域内数据集优于规模大得多的通用数据集,能够以更低计算成本实现优秀翻译质量。第3章介绍了一种分阶段的QE训练流程,结合领域自适应与轻量级数据增强技术,该方法在跨领域、跨语言及资源设置(包括零样本和跨语言场景)下均提升了性能。第4章研究了子词分词与词汇表对微调的影响,统一的分词-词汇表配置能实现稳定训练和更优翻译质量,而不匹配的配置则会降低性能。第5章提出了一种面向大型语言模型的QE引导上下文学习方法,QE模型选取示例以提升翻译质量且无需参数更新,性能优于标准检索方法;该方法还支持无参考集设置,减少了对单一参考集的依赖。这些结果表明领域自适应依赖于数据选择、表示及高效适配策略。本论文为构建在特定领域表现可靠的MT和QE系统提供了方法论。