Improving LLM Abilities in Idiomatic Translation

For large language models (LLMs) like NLLB and GPT, translating idioms remains a challenge. Our goal is to enhance translation fidelity by improving LLM processing of idiomatic language while preserving the original linguistic style. This has a significant social impact, as it preserves cultural nuances and ensures translated texts retain their intent and emotional resonance, fostering better cross-cultural communication. Previous work has utilized knowledge bases like IdiomKB by providing the LLM with the meaning of an idiom to use in translation. Although this method yielded better results than a direct translation, it is still limited in its ability to preserve idiomatic writing style across languages. In this research, we expand upon the knowledge base to find corresponding idioms in the target language. Our research performs translations using two methods: The first method employs the SentenceTransformers model to semantically generate cosine similarity scores between the meanings of the original and target language idioms, selecting the best idiom (Cosine Similarity method). The second method uses an LLM to find a corresponding idiom in the target language for use in the translation (LLM-generated idiom method). As a baseline, we performed a direct translation without providing additional information. Human evaluations on the English -> Chinese, and Chinese -> English show the Cosine Similarity Lookup method out-performed others in all GPT4o translations. To further build upon IdiomKB, we developed a low-resource Urdu dataset containing Urdu idioms and their translations. Despite dataset limitations, the Cosine Similarity Lookup method shows promise, potentially overcoming language barriers and enabling the exploration of diverse literary works in Chinese and Urdu.

翻译：对于NLLB和GPT等大型语言模型（LLMs）而言，习语翻译仍然是一项挑战。本研究旨在通过改进LLMs对习语的处理能力来提升翻译的忠实度，同时保持原有的语言风格。这项工作具有重要的社会意义，因为它能保留文化细微差异，确保译文保持其意图和情感共鸣，从而促进更好的跨文化交流。先前的研究通过向LLM提供习语含义（例如使用IdiomKB等知识库）来辅助翻译。尽管这种方法比直接翻译取得了更好的效果，但在保持跨语言习语文体风格方面仍存在局限。在本研究中，我们扩展了知识库，以在目标语言中寻找对应的习语。我们的研究采用两种方法进行翻译：第一种方法使用SentenceTransformers模型，通过语义计算源语言与目标语言习语含义之间的余弦相似度得分，并选择最佳匹配习语（余弦相似度方法）。第二种方法利用LLM在目标语言中生成对应的习语用于翻译（LLM生成习语方法）。作为基线，我们在不提供额外信息的情况下进行了直接翻译。在英语->中文和中文->英语方向的人工评估表明，在所有GPT4o翻译中，余弦相似度查找方法的性能均优于其他方法。为了进一步扩展IdiomKB，我们构建了一个低资源的乌尔都语数据集，包含乌尔都语习语及其翻译。尽管数据集存在限制，余弦相似度查找方法仍显示出潜力，有望克服语言障碍，促进中文和乌尔都语多样文学作品的探索。