While recent large language models (LLMs) demonstrate remarkable abilities in responding to queries in diverse languages, their ability to handle long multilingual contexts is unexplored. As such, a systematic evaluation of the long-context capabilities of LLMs in multilingual settings is crucial, specifically in the context of information retrieval. To address this gap, we introduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test, designed to assess a model's ability to retrieve relevant information (the needle) from a collection of multilingual distractor texts (the haystack). This test serves as an extension of the multilingual question-answering task, encompassing both monolingual and cross-lingual retrieval. We evaluate four state-of-the-art LLMs on MLNeedle. Our findings reveal that model performance can vary significantly with language and needle position. Specifically, we observe that model performance is the lowest when the needle is (i) in a language outside the English language family and (ii) located in the middle of the input context. Furthermore, although some models claim a context size of $8k$ tokens or greater, none demonstrate satisfactory cross-lingual retrieval performance as the context length increases. Our analysis provides key insights into the long-context behavior of LLMs in multilingual settings to guide future evaluation protocols. To our knowledge, this is the first study to investigate the multilingual long-context behavior of LLMs.
翻译:尽管近期的大语言模型(LLM)在响应多种语言查询方面展现出卓越能力,但其处理多语言长上下文的能力尚未得到探索。因此,在信息检索的背景下,系统评估LLM在多语言环境中的长上下文能力至关重要。为填补这一空白,我们提出了多语言大海捞针(MLNeedle)测试,旨在评估模型从多语言干扰文本集合(“干草堆”)中检索相关信息(“针”)的能力。该测试作为多语言问答任务的延伸,涵盖了单语言和跨语言检索。我们在MLNeedle上评估了四个最先进的LLM。研究发现,模型性能会随语言和“针”的位置发生显著变化。具体而言,我们观察到当“针”(i)处于英语语系之外的语言中,且(ii)位于输入上下文的中间位置时,模型性能最低。此外,尽管某些模型声称支持$8k$或更多令牌的上下文长度,但随着上下文长度增加,没有模型展现出令人满意的跨语言检索性能。我们的分析为理解LLM在多语言环境中的长上下文行为提供了关键见解,以指导未来的评估方案。据我们所知,这是首个探究LLM多语言长上下文行为的研究。