Consistency is a key requirement of high-quality translation. It is especially important to adhere to pre-approved terminology and adapt to corrected translations in domain-specific projects. Machine translation (MT) has achieved significant progress in the area of domain adaptation. However, in-domain data scarcity is common in translation settings, due to the lack of specialised datasets and terminology, or inconsistency and inaccuracy of available in-domain translations. In such scenarios where there is insufficient in-domain data to fine-tune MT models, producing translations that are consistent with the relevant context is challenging. While real-time adaptation can make use of smaller amounts of in-domain data to improve the translation on the fly, it remains challenging due to supported context limitations and efficiency constraints. Large language models (LLMs) have recently shown interesting capabilities of in-context learning, where they learn to replicate certain input-output text generation patterns, without further fine-tuning. Such capabilities have opened new horizons for domain-specific data augmentation and real-time adaptive MT. This work attempts to address two main relevant questions: 1) in scenarios involving human interaction and continuous feedback, can we employ language models to improve the quality of adaptive MT at inference time? and 2) in the absence of sufficient in-domain data, can we use pre-trained large-scale language models to improve the process of MT domain adaptation?
翻译:一致性是高质量翻译的关键要求。在特定领域的翻译项目中,遵循预先批准的术语并适应修正后的翻译尤为重要。机器翻译(MT)在领域自适应方面已取得显著进展。然而,由于缺乏专业数据集和术语,或现有领域翻译的不一致性和不准确性,翻译场景中常出现领域数据稀缺的情况。在缺乏足够领域数据微调机器翻译模型的场景下,生成与相关语境一致的翻译颇具挑战。尽管实时自适应可以利用少量领域数据即时提升翻译质量,但由于上下文支持限制和效率约束,这仍存在困难。大型语言模型(LLM)近期展现出引人注目的上下文学习能力——它们无需进一步微调即可学习复制特定的输入-输出文本生成模式。这种能力为特定领域的数据增强和实时自适应机器翻译开辟了新方向。本研究试图解决两个核心问题:1)在涉及人机交互与持续反馈的场景中,是否能在推理阶段利用语言模型提升自适应机器翻译的质量?2)在缺乏充足领域数据的情况下,能否利用预训练的大规模语言模型优化机器翻译的领域自适应过程?