Generative Large Language Models (LLMs) have achieved remarkable advancements in various NLP tasks. In this work, our aim is to explore the multilingual capabilities of large language models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large language models, followed by exploring the in-context learning capabilities of the same raw models. We fine-tune these large language models using parameter efficient fine-tuning methods such as LoRA and additionally with full fine-tuning. Through our study, we have identified the best performing large language model for the translation task involving LLMs, which is based on LLaMA. Our results demonstrate significant progress, with average BLEU scores of 13.42, 15.93, 12.13, 12.30, and 12.07, as well as CHRF scores of 43.98, 46.99, 42.55, 42.42, and 45.39, respectively, using 2-stage fine-tuned LLaMA-13b for English to Indian languages on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Similarly, for Indian languages to English, we achieved average BLEU scores of 14.03, 16.65, 16.17, 15.35 and 12.55 along with chrF scores of 36.71, 40.44, 40.26, 39.51, and 36.20, respectively, using fine-tuned LLaMA-13b on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Overall, our findings highlight the potential and strength of large language models for machine translation capabilities, including for languages that are currently underrepresented in LLMs.
翻译:生成式大型语言模型(LLMs)已在多种自然语言处理任务中取得显著进展。本研究旨在通过机器翻译任务(涉及英语及22种印度语言)探索大型语言模型的多语言能力。我们首先评估原始大型语言模型的翻译能力,接着探究相同原始模型的上下文学习能力。采用参数高效微调方法(如LoRA)及全量微调对模型进行优化。研究中,我们识别出基于LLaMA实现的、针对LLM翻译任务表现最佳的大型语言模型。实验结果表明,使用两阶段微调的LLaMA-13b模型在IN22(对话)、IN22(通用)、flores200-dev、flores200-devtest及newstest2019测试集上的英译印任务中,平均BLEU分数分别达13.42、15.93、12.13、12.30、12.07,CHRF分数分别为43.98、46.99、42.55、42.42、45.39;在印译英任务中,使用微调后的LLaMA-13b模型在上述测试集上平均BLEU分数分别为14.03、16.65、16.17、15.35、12.55,chrF分数分别为36.71、40.44、40.26、39.51、36.20。总体而言,我们的发现凸显了大型语言模型在机器翻译任务中的潜力与优势,尤其适用于当前LLM中代表性不足的语言。