Large language models (LLMs) have shown superior capabilities in translating figurative language compared to neural machine translation (NMT) systems. However, the impact of different prompting methods and LLM-NMT combinations on idiom translation has yet to be thoroughly investigated. This paper introduces two parallel datasets of sentences containing idiomatic expressions for Persian$\rightarrow$English and English$\rightarrow$Persian translations, with Persian idioms sampled from our PersianIdioms resource, a collection of 2,200 idioms and their meanings. Using these datasets, we evaluate various open- and closed-source LLMs, NMT models, and their combinations. Translation quality is assessed through idiom translation accuracy and fluency. We also find that automatic evaluation methods like LLM-as-a-judge, BLEU and BERTScore are effective for comparing different aspects of model performance. Our experiments reveal that Claude-3.5-Sonnet delivers outstanding results in both translation directions. For English$\rightarrow$Persian, combining weaker LLMs with Google Translate improves results, while Persian$\rightarrow$English translations benefit from single prompts for simpler models and complex prompts for advanced ones.
翻译:相比神经机器翻译系统,大语言模型在比喻性语言翻译方面展现出更优越的能力。然而,不同的提示方法以及LLM与NMT的组合对习语翻译的影响尚未得到深入研究。本文构建了两个包含习语表达的平行数据集,分别用于波斯语→英语和英语→波斯语的翻译研究,其中波斯语习语样本来自我们开发的PersianIdioms资源库——该库收录了2,200条习语及其释义。基于这些数据集,我们评估了多种开源与闭源LLM、NMT模型及其组合方案。翻译质量通过习语翻译准确度和流畅度进行衡量。研究发现,LLM-as-a-judge、BLEU和BERTScore等自动评估方法能有效比较模型在不同维度的性能表现。实验结果表明,Claude-3.5-Sonnet在双向翻译中均取得卓越表现。在英语→波斯语翻译中,将性能较弱的LLM与谷歌翻译结合可提升翻译质量;而在波斯语→英语翻译中,简单模型适合使用单一提示,高级模型则更适合复杂提示策略。