Recently, there has been growing interest in extending the context length of large language models (LLMs), aiming to effectively process long inputs of one turn or conversations with more extensive histories. While proprietary models such as GPT-4 and Claude can largely preserve the reasoning ability in an extended context, open-source models are still progressing through the early stages of development. To bridge this gap, we propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) addressing two key aspects: dataset construction and evaluation metrics. On the one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs encompassing diverse question styles, domains, and input length (3k$\sim$200k tokens). On the other hand, we investigate the effectiveness in evalution metrics for LCLMs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment, and thus we strongly advocate for length-instruction-enhanced (LIE) evaluation and employing LLM judges. We conducted a comprehensive study of 4 popular commercial LLMs and 12 open-source counterparts using the L-Eval benchmark. Our empirical findings offer useful insights into the study of LCLMs and lay the groundwork for the development of more principled evaluation of these models.
翻译:近期,扩展大语言模型(LLMs)上下文长度的研究日益兴起,旨在有效处理单轮长输入或包含更丰富历史记录的对话。尽管GPT-4、Claude等专有模型能够在扩展上下文中基本保持推理能力,但开源模型仍处于发展初期阶段。为弥合这一差距,我们提出L-Eval以建立更标准化的长上下文语言模型(LCLMs)评估体系,重点解决两个关键问题:数据集构建与评估指标。一方面,我们构建了包含20个子任务、508篇长文档及超过2000组人工标注问答对的新型评估套件,涵盖多样化的提问风格、领域及输入长度(3k∼200k词元)。另一方面,我们研究了适用于LCLMs的有效评估指标。研究表明,流行的n-gram匹配指标通常无法与人类判断形成良好关联,因此我们强烈倡导采用长度指令增强(LIE)评估方案并引入LLM裁判机制。基于L-Eval基准,我们对4种主流商业LLM与12种开源模型开展了系统性研究。实证结果不仅为LCLMs研究提供了重要见解,更为建立这些模型更规范的评估准则奠定了基础。