Recent large language models (LLMs) demonstrate impressive capabilities in handling long contexts, some exhibiting near-perfect recall on synthetic retrieval tasks. However, these evaluations have mainly focused on English text and involved a single target sentence within lengthy contexts. Our work investigates how LLM performance generalizes to multilingual settings with multiple hidden target sentences. We comprehensively evaluate several long-context LLMs on retrieval and reasoning tasks across five languages: English, Vietnamese, Indonesian, Swahili, and Somali. These languages share the Latin script but belong to distinct language families and resource levels. Our analysis reveals a significant performance gap between languages. The best-performing models such as Gemini-1.5 and GPT-4o, achieve around 96% accuracy in English to around 36% in Somali with a single target sentence. However, this accuracy drops to 40% in English and 0% in Somali when dealing with three target sentences. Our findings highlight the challenges long-context LLMs face when processing longer contexts, an increase in the number of target sentences, or languages of lower resource levels.
翻译:近期的大型语言模型(LLM)在处理长上下文方面展现出令人印象深刻的能力,部分模型在合成检索任务上表现出近乎完美的召回率。然而,这些评估主要集中于英文文本,且仅涉及长上下文中的单个目标句子。本研究探讨了LLM性能如何推广至包含多个隐藏目标句子的多语言场景。我们全面评估了多个长上下文LLM在五种语言(英语、越南语、印尼语、斯瓦希里语和索马里语)上的检索与推理任务。这些语言虽共享拉丁字母体系,但分属不同的语系及资源水平。分析结果显示,不同语言之间存在显著的性能差距。性能最佳的模型(如Gemini-1.5和GPT-4o)在英语中处理单个目标句子的准确率约为96%,而在索马里语中仅为36%左右。当处理三个目标句子时,准确率在英语中降至40%,在索马里语中则降至0%。我们的研究结果凸显了长上下文LLM在处理更长上下文、目标句子数量增加或低资源语言时所面临的挑战。