Human-in-the-loop Machine Translation with Large Language Model

The large language model (LLM) has garnered significant attention due to its in-context learning mechanisms and emergent capabilities. The research community has conducted several pilot studies to apply LLMs to machine translation tasks and evaluate their performance from diverse perspectives. However, previous research has primarily focused on the LLM itself and has not explored human intervention in the inference process of LLM. The characteristics of LLM, such as in-context learning and prompt engineering, closely mirror human cognitive abilities in language tasks, offering an intuitive solution for human-in-the-loop generation. In this study, we propose a human-in-the-loop pipeline that guides LLMs to produce customized outputs with revision instructions. The pipeline initiates by prompting the LLM to produce a draft translation, followed by the utilization of automatic retrieval or human feedback as supervision signals to enhance the LLM's translation through in-context learning. The human-machine interactions generated in this pipeline are also stored in an external database to expand the in-context retrieval database, enabling us to leverage human supervision in an offline setting. We evaluate the proposed pipeline using GPT-3.5-turbo API on five domain-specific benchmarks for German-English translation. The results demonstrate the effectiveness of the pipeline in tailoring in-domain translations and improving translation performance compared to direct translation. Additionally, we discuss the results from the following perspectives: 1) the effectiveness of different in-context retrieval methods; 2) the construction of a retrieval database under low-resource scenarios; 3) the observed domains differences; 4) the quantitative analysis of linguistic statistics; and 5) the qualitative analysis of translation cases. The code and data are available at https://github.com/NLP2CT/HIL-MT/.

翻译：大语言模型因其上下文学习机制和涌现能力而受到广泛关注。研究界已开展多项试点研究，将大语言模型应用于机器翻译任务并从多维度评估其性能。然而，现有研究主要聚焦于大语言模型本身，尚未探索人类干预大语言模型推理过程的可能性。大语言模型的上下文学习和提示工程等特性与人类在语言任务中的认知能力高度契合，为人在回路生成提供了直观解决方案。本研究提出一种人在回路的流水线方法，通过修订指令引导大语言模型生成定制化输出。该流水线首先提示大语言模型生成草稿翻译，随后利用自动检索或人工反馈作为监督信号，通过上下文学习增强模型翻译质量。流水线中产生的人机交互记录同时存储于外部数据库以扩展上下文检索库，从而支持离线状态下利用人类监督信号。我们使用GPT-3.5-turbo接口在五个德英翻译专业领域基准测试中评估该流水线。结果表明，与直接翻译相比，该流水线在领域内翻译定制和翻译性能提升方面均有效。此外，我们从以下几个方面对结果展开讨论：1)不同上下文检索方法的效果；2)低资源场景下检索数据库的构建；3)观察到的领域差异；4)语言统计数据的量化分析；5)翻译案例的定性分析。代码与数据见https://github.com/NLP2CT/HIL-MT/。