Generative large language models (LLMs) have demonstrated exceptional proficiency in various natural language processing (NLP) tasks, including machine translation, question answering, text summarization, and natural language understanding. To further enhance the performance of LLMs in machine translation, we conducted an investigation into two popular prompting methods and their combination, focusing on cross-language combinations of Persian, English, and Russian. We employed n-shot feeding and tailored prompting frameworks. Our findings indicate that multilingual LLMs like PaLM exhibit human-like machine translation outputs, enabling superior fine-tuning of desired translation nuances in accordance with style guidelines and linguistic considerations. These models also excel in processing and applying prompts. However, the choice of language model, machine translation task, and the specific source and target languages necessitate certain considerations when adopting prompting frameworks and utilizing n-shot in-context learning. Furthermore, we identified errors and limitations inherent in popular LLMs as machine translation tools and categorized them based on various linguistic metrics. This typology of errors provides valuable insights for utilizing LLMs effectively and offers methods for designing prompts for in-context learning. Our report aims to contribute to the advancement of machine translation with LLMs by improving both the accuracy and reliability of evaluation metrics.
翻译:生成式大型语言模型(LLMs)在各类自然语言处理任务(包括机器翻译、问答、文本摘要及自然语言理解)中展现出卓越能力。为进一步提升LLMs在机器翻译中的性能,我们针对波斯语、英语和俄语的跨语言组合,系统研究了两种主流提示方法及其组合策略。通过采用n-shot样本注入与定制化提示框架,研究发现多语言LLMs(如PaLM)能产生类人化的机器翻译输出,从而根据文体规范与语言学考量实现对目标翻译风格的精细调控。这些模型在处理与应用提示方面同样表现优异。然而,语言模型的选择、机器翻译任务的特性以及具体源语言与目标语言的差异,在采用提示框架和运用n-shot上下文学习时仍需审慎考量。此外,我们识别了当前主流LLMs作为机器翻译工具时存在的固有错误与局限,并基于多种语言学指标对其进行了分类。这种错误类型学分析为有效利用LLMs提供了重要见解,并提出了面向上下文学习的提示设计方法。本研究通过提升评估指标的准确性与可靠性,旨在推动基于LLMs的机器翻译技术发展。