Large language models (LLMs) can perform impressive feats with in-context learning or lightweight finetuning. It is natural to wonder how well these models adapt to genuinely new tasks, but how does one find tasks that are unseen in internet-scale training sets? We turn to a field that is explicitly motivated and bottlenecked by a scarcity of web data: low-resource languages. In this paper, we introduce MTOB (Machine Translation from One Book), a benchmark for learning to translate between English and Kalamang -- a language with less than 200 speakers and therefore virtually no presence on the web -- using several hundred pages of field linguistics reference materials. This task framing is novel in that it asks a model to learn a language from a single human-readable book of grammar explanations, rather than a large mined corpus of in-domain data, more akin to L2 learning than L1 acquisition. We demonstrate that baselines using current LLMs are promising but fall short of human performance, achieving 44.7 chrF on Kalamang to English translation and 45.8 chrF on English to Kalamang translation, compared to 51.6 and 57.0 chrF by a human who learned Kalamang from the same reference materials. We hope that MTOB will help measure LLM capabilities along a new dimension, and that the methods developed to solve it could help expand access to language technology for underserved communities by leveraging qualitatively different kinds of data than traditional machine translation.
翻译:大型语言模型(LLMs)通过上下文学习或轻量级微调即可展现令人印象深刻的能力。人们自然好奇这些模型如何适应真正的新任务,但如何找到互联网规模训练集中未出现过的任务呢?我们转向一个由网络数据稀缺性明确驱动并受其制约的领域:低资源语言。本文提出MTOB(基于一本书的机器翻译)基准测试,该任务要求利用数百页现场语言学参考材料,学习英语与卡拉芒语(一种仅有不足200名使用者、几乎不存在于网络的语言)之间的翻译。这一任务框架的创新之处在于:它要求模型仅通过一本人类可读的语法讲解书学习语言,而非大规模领域内数据挖掘语料库,更接近第二语言习得(L2学习)而非第一语言习得(L1习得)。我们证明,基于当前LLM的基线方法虽有前景但未达到人类水平——在与使用相同参考材料学习卡拉芒语的人类相比时,模型在卡拉芒语到英语翻译中取得44.7 chrF,英语到卡拉芒语翻译中取得45.8 chrF,而人类得分分别为51.6和57.0 chrF。我们希望MTOB能沿新维度衡量LLM能力,且解决该问题的方法可通过利用与传统机器翻译本质不同的数据类型,帮助扩大服务不足社群的语言技术可及性。