M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models

Managing long sequences has become an important and necessary feature for large language models (LLMs). However, it is still an open question of how to comprehensively and systematically evaluate the long-sequence capability of LLMs. One of the reasons is that conventional and widely-used benchmarks mainly consist of short sequences. In this paper, we propose M4LE, a Multi-ability, Multi-range, Multi-task, Multi-domain benchmark for Long-context Evaluation. M4LE is based on a diverse NLP task pool comprising 36 NLP datasets, 11 task types and 12 domains. To alleviate the scarcity of tasks with naturally long sequences and incorporate multiple-ability assessment, we propose an automatic approach (but with negligible human annotations) to convert short-sequence tasks into a unified long-sequence scenario where LLMs have to identify single or multiple relevant spans in long contexts based on explicit or semantic hints. Specifically, the scenario includes five different types of abilities: (1) explicit single-span; (2) semantic single-span; (3) explicit multiple-span; (4) semantic multiple-span; and (5) global context understanding. The resulting samples in M4LE are evenly distributed from 1k to 8k input length. We conducted a systematic evaluation on 11 well-established LLMs, especially those optimized for long-sequence inputs. Our results reveal that: 1) Current LLMs struggle to understand long context, particularly when tasks require multiple-span attention. 2) Semantic retrieval task is more difficult for competent LLMs. 3) Models fine-tuned on longer text with position interpolation have comparable performance to those using Neural Tangent Kernel (NTK) aware scaling methods without fine-tuning. We make our benchmark publicly available to encourage future research in this challenging area.

翻译：管理长序列已成为大语言模型（LLMs）一项重要且必要的功能。然而，如何全面系统地评估LLMs的长序列能力仍是一个开放性问题。其中一个原因是传统广泛使用的基准数据集主要由短序列构成。本文提出M4LE，一个多能力、多范围、多任务、多领域的长上下文评估基准。M4LE基于包含36个NLP数据集、11种任务类型和12个领域的多样化NLP任务池构建。为缓解自然长序列任务稀缺的问题并融入多能力评估，我们提出一种自动方法（仅需极少量人工标注），将短序列任务转化为统一的长序列场景——在此场景中，LLMs需基于显式或语义线索，从长上下文中识别单个或多个相关片段。具体而言，该场景包含五种能力类型：（1）显式单片段；（2）语义单片段；（3）显式多片段；（4）语义多片段；（5）全局上下文理解。M4LE生成的样本输入长度在1k至8k之间均匀分布。我们对11个成熟的LLMs（尤其针对长序列输入优化的模型）进行了系统评估。结果表明：（1）当前LLMs难以理解长上下文，尤其当任务需要多片段注意力时；（2）对能力较强的LLMs而言，语义检索任务更具挑战；（3）使用位置插值在更长文本上微调的模型，与采用无需微调的神经正切核（NTK）感知缩放方法的模型性能相当。我们公开该基准，以促进这一挑战性领域的未来研究。