This study addresses the challenge of extending Large Language Models (LLMs) to non-English languages, specifically those using non-Latin scripts. We propose an innovative approach that utilizes the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. Focusing on Hindi, we demonstrate through Hindi-to-English translation and sentiment analysis tasks that romanized text not only significantly improves inference efficiency due to its lower fertility compared to native text but also achieves competitive performance with limited pre-training. Additionally, our novel multi-script prompting approach, which combines romanized and native texts, shows promise in further enhancing task performance. These findings suggest the potential of romanization in bridging the language gap for LLM applications, with future work aimed at expanding this approach to more languages and tasks.
翻译:本研究致力于解决将大型语言模型(LLMs)扩展到非拉丁字母的非英语语言所面临的挑战。我们提出了一种创新方法,利用文本的罗马化形式作为LLMs的接口,并假设其非正式使用中的高频特点以及与英语共享的词元可增强跨语言对齐。以印地语为例,通过印地语到英语的翻译和情感分析任务,我们证明罗马化文本不仅因其相比原生文本更低的生成效率(fertility)而显著提升推理效率,还能在有限预训练下达到具有竞争力的性能。此外,我们提出的结合罗马化文本与原生文本的新型多脚本提示方法,在进一步提升任务性能方面展现出潜力。这些发现揭示了罗马化在缩小LLM应用语言差距方面的可能性,未来工作将致力于将该方法扩展到更多语言和任务。