Translation into severely low-resource languages has both the cultural goal of saving and reviving those languages and the humanitarian goal of assisting the everyday needs of local communities that are accelerated by the recent COVID-19 pandemic. In many humanitarian efforts, translation into severely low-resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, low-resource languages may be possible and reduce human translation effort. We attempt to leverage translation resources from rich-resource languages to efficiently produce best possible translation quality for well known texts, which are available in multiple languages, in a new, low-resource language. To reach this goal, we argue that in translating a closed text into low-resource languages, generalization to out-of-domain texts is not necessary, but generalization to new languages is. Performance gain comes from massive source parallelism by careful choice of close-by language families, style-consistent corpus-level paraphrases within the same language and strategic adaptation of existing large pretrained multilingual models to the domain first and then to the language. Such performance gain makes it possible for machine translation systems to collaborate with human translators to expedite the translation process into new, low-resource languages.
翻译:针对极度低资源语言的翻译兼具文化目标(拯救与复兴此类语言)和人道主义目标(满足当地社区因新冠疫情加剧的日常需求)。在许多人道主义行动中,极度低资源语言的翻译通常不需要通用翻译引擎,而需要面向特定文本的专用翻译引擎。例如,医疗记录、卫生规程、政府公告、应急流程和宗教文本均属受限文本。尽管目前不存在适用于所有语言的通用翻译引擎,但对多语言已知受限文本进行跨低资源语言翻译或可实现,从而减少人工翻译工作量。我们尝试利用高资源语言的翻译资源,高效生成多语言已知文本在新型低资源语言中的最优翻译质量。为实现这一目标,我们认为在将封闭文本翻译为低资源语言时,跨领域泛化并非必要,而跨语言泛化才是关键。性能提升源于以下策略:通过精心选择相近语系实现大规模源语言并行化、在同一语言内生成风格一致的语料级释义、以及优先将现有大型预训练多语言模型适配至目标领域再适配至目标语言。这种性能提升使机器翻译系统能与人工译者协作,加速新型低资源语言的翻译进程。