Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

While recent large language models (LLMs) demonstrate remarkable abilities in responding to queries in diverse languages, their ability to handle long multilingual contexts is unexplored. As such, a systematic evaluation of the long-context capabilities of LLMs in multilingual settings is crucial, specifically in the context of information retrieval. To address this gap, we introduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test, designed to assess a model's ability to retrieve relevant information (the needle) from a collection of multilingual distractor texts (the haystack). This test serves as an extension of the multilingual question-answering task, encompassing both monolingual and cross-lingual retrieval. We evaluate four state-of-the-art LLMs on MLNeedle. Our findings reveal that model performance can vary significantly with language and needle position. Specifically, we observe that model performance is the lowest when the needle is (i) in a language outside the English language family and (ii) located in the middle of the input context. Furthermore, although some models claim a context size of $8k$ tokens or greater, none demonstrate satisfactory cross-lingual retrieval performance as the context length increases. Our analysis provides key insights into the long-context behavior of LLMs in multilingual settings to guide future evaluation protocols. To our knowledge, this is the first study to investigate the multilingual long-context behavior of LLMs.

翻译：尽管近期的大语言模型（LLM）在响应多种语言查询方面展现出卓越能力，但其处理多语言长上下文的能力尚未得到探索。因此，在信息检索的背景下，系统评估LLM在多语言环境中的长上下文能力至关重要。为填补这一空白，我们提出了多语言大海捞针（MLNeedle）测试，旨在评估模型从多语言干扰文本集合（“干草堆”）中检索相关信息（“针”）的能力。该测试作为多语言问答任务的延伸，涵盖了单语言和跨语言检索。我们在MLNeedle上评估了四个最先进的LLM。研究发现，模型性能会随语言和“针”的位置发生显著变化。具体而言，我们观察到当“针”（i）处于英语语系之外的语言中，且（ii）位于输入上下文的中间位置时，模型性能最低。此外，尽管某些模型声称支持$8k$或更多令牌的上下文长度，但随着上下文长度增加，没有模型展现出令人满意的跨语言检索性能。我们的分析为理解LLM在多语言环境中的长上下文行为提供了关键见解，以指导未来的评估方案。据我们所知，这是首个探究LLM多语言长上下文行为的研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日