As Large Language Models (LLMs) continue to advance in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become crucial for real-world applications. While existing benchmarks assess various LLM capabilities, they rarely focus on instruction-following in long-context scenarios or stability on different inputs. In response, we introduce the Long-context Instruction-Following Benchmark (LIFBench), a scalable dataset designed to evaluate LLMs' instruction-following capabilities and stability across long contexts. LIFBench comprises three long-context scenarios and eleven diverse tasks, supported by 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEval, a rubric-based assessment framework that provides precise, automated scoring of complex LLM responses without relying on LLM-assisted evaluations or human judgments. This approach facilitates a comprehensive analysis of model performance and stability across various perspectives. We conduct extensive experiments on 20 notable LLMs across six length intervals, analyzing their instruction-following capabilities and stability. Our work contributes LIFBench and LIFEval as robust tools for assessing LLM performance in complex, long-context settings, providing insights that can inform future LLM development.
翻译:随着大语言模型(LLM)在自然语言处理(NLP)领域的持续进步,其在长上下文输入中稳定遵循指令的能力已成为实际应用的关键。尽管现有基准测试评估了LLM的多种能力,但鲜少专注于长上下文场景下的指令遵循能力或在不同输入上的稳定性。为此,我们提出了长上下文指令遵循基准测试(LIFBench),这是一个可扩展的数据集,旨在评估LLM在长上下文中的指令遵循能力和稳定性。LIFBench包含三种长上下文场景和十一项多样化任务,通过自动化扩展方法在长度、表达和变量三个维度上生成了2,766条指令作为支持。为进行评估,我们提出了LIFEval,一个基于量规的评估框架,能够对复杂的LLM响应提供精确、自动化的评分,无需依赖LLM辅助评估或人工判断。该方法有助于从多角度对模型性能和稳定性进行全面分析。我们在六个长度区间上对20个知名LLM进行了广泛实验,分析了它们的指令遵循能力和稳定性。我们的工作贡献了LIFBench和LIFEval作为评估LLM在复杂长上下文设置中性能的稳健工具,所提供的见解可为未来LLM的发展提供参考。