In the quest to overcome language barriers, encoder-decoder models like NLLB have expanded machine translation to rare languages, with some models (e.g., NLLB 1.3B) even trainable on a single GPU. While general-purpose LLMs perform well in translation, open LLMs prove highly competitive when fine-tuned for specific tasks involving unknown corpora. We introduce LYRA (Language verY Rare for All), a novel approach that combines open LLM fine-tuning, retrieval-augmented generation (RAG), and transfer learning from related high-resource languages. This study is exclusively focused on single-GPU training to facilitate ease of adoption. Our study focuses on two-way translation between French and Mon\'egasque, a rare language unsupported by existing translation tools due to limited corpus availability. Our results demonstrate LYRA's effectiveness, frequently surpassing and consistently matching state-of-the-art encoder-decoder models in rare language translation.
翻译:在克服语言障碍的探索中,诸如NLLB等编码器-解码器模型已将机器翻译扩展至稀有语言,部分模型(例如NLLB 1.3B)甚至可在单GPU上完成训练。尽管通用大语言模型在翻译任务中表现良好,但开源大语言模型在针对涉及未知语料的特定任务进行微调时,展现出极强的竞争力。本文提出LYRA(语言极稀有通用翻译模型),这是一种融合开源大语言模型微调、检索增强生成以及从相关高资源语言进行迁移学习的新方法。本研究专注于单GPU训练方案,以降低技术采用门槛。我们聚焦于法语与摩纳哥语之间的双向翻译任务——由于可用语料稀缺,该稀有语言尚未被现有翻译工具支持。实验结果表明,LYRA在稀有语言翻译任务中效果显著,其性能不仅频繁超越、且始终匹配当前最先进的编码器-解码器模型。