GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely "GenTranslate", which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model.

翻译：摘要：近年来，大型语言模型（LLMs）的进展通过减少表示误差和整合外部知识，推动了多语言语音与机器翻译的发展。然而，这两种翻译任务通常采用波束搜索解码和top-1假设选择进行推理。这些技术难以充分利用多样化的N-best假设中蕴含的丰富信息，使得它们在需要单一高质量输出序列的翻译任务中效果欠佳。本文提出一种新的翻译任务生成范式，即“GenTranslate”，该范式基于LLMs从N-best列表中的多样化翻译版本生成更优结果。利用LLMs丰富的语言知识和强大的推理能力，我们的新范式能够整合N-best候选中的丰富信息，生成更高质量的翻译结果。此外，为支持LLM微调，我们构建并发布了HypoTranslate数据集，其中包含11种语言的超过59.2万对假设-翻译数据。在多种语音与机器翻译基准（例如FLEURS、CoVoST-2、WMT）上的实验表明，我们的GenTranslate显著优于现有最先进模型。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日