Low-resource languages pose a challenge for machine translation with large language models (LLMs), which require large amounts of training data. One potential way to circumvent this data dependence is to rely on LLMs' ability to use in-context descriptions of languages, like textbooks and dictionaries. To do so, LLMs must be able to infer the link between the languages' grammatical descriptions and the sentences in question. Here we isolate this skill using a formal analogue of the task: string transduction based on a formal grammar provided in-context. We construct synchronous context-free grammars which define pairs of formal languages designed to model particular aspects of natural language grammar, morphology, and written representation. Using these grammars, we measure how well LLMs can translate sentences from one formal language into another when given both the grammar and the source-language sentence. We vary the size of the grammar, the lengths of the sentences, the syntactic and morphological properties of the languages, and their written script. We note three key findings. First, LLMs' translation accuracy decreases markedly as a function of grammar size and sentence length. Second, differences in morphology and written representation between the source and target languages can strongly diminish model performance. Third, we examine the types of errors committed by models and find they are most prone to recall the wrong words from the target language vocabulary, hallucinate new words, or leave source-language words untranslated.
翻译:低资源语言对基于大语言模型(LLMs)的机器翻译构成挑战,因为这类模型需要海量训练数据。规避这种数据依赖性的潜在途径之一是依靠LLMs利用上下文语言描述(如教科书和词典)的能力。为此,LLMs必须能够推断语言语法描述与相关句子之间的关联。本文通过该任务的正式类比——基于上下文提供的正式文法的字符串转换——来分离这一能力。我们构建了同步上下文无关文法,用于定义成对的形式语言,旨在模拟自然语言语法、形态学和书面表征的特定方面。利用这些文法,我们测量了LLMs在给定文法和源语言句子时,将句子从一种形式语言翻译为另一种形式语言的能力。我们变化了文法规模、句子长度、语言的句法和形态属性以及其书写文字。有三个关键发现:第一,LLMs的翻译准确率随文法规模和句子长度增加而显著下降;第二,源语言与目标语言在形态学和书面表征上的差异会严重削弱模型性能;第三,通过分析模型所犯错误类型,我们发现模型最易从目标语言词汇中错误召回词语、凭空编造新词,或保留源语言词汇未译。