The rapid development of Large Language Models (LLMs) has led to great strides in model capabilities like reasoning and long-context understanding. However, as LLMs are able to process longer contexts, it becomes more challenging to evaluate whether they have acquired certain capabilities, since the length of text (e.g., 100K tokens) they can process far exceeds what humans can reliably assess in a reasonable duration. In this paper, we propose using complex synthetic tasks as a proxy evaluation method, and present S3Eval, a Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation. As a synthetic benchmark, S3Eval enables the creation of any number of evaluation examples that are theoretically invisible to LLMs, mitigating the test set contamination issue. The synthetic nature of S3Eval provides users full control over the dataset, allowing them to systematically probe LLM capabilities by scaling text length and varying task difficulty across diverse scenarios. The strong correlation between S3Eval performance and scores of real-world benchmarks like Big-Bench Hard (BBH) demonstrates the soundness of using S3Eval for evaluation of LLMs. The in-depth analysis also uncover additional insights, including performance drop when the answer is sparsely distributed or located in the middle context, as well as some counter-intuitive trends of model performance.
翻译:大语言模型(LLM)的快速发展显著提升了推理与长上下文理解等能力。然而,随着LLM能处理的上下文长度(例如10万词元)远超人类在合理时间内可靠评估的范围,判断模型是否真正具备特定能力变得愈发困难。本文提出采用复杂合成任务作为代理评估方法,并发布S3Eval——一个合成、可扩展、系统性的LLM评估套件。作为合成基准,S3Eval可生成大量对LLM理论上不可见的评估样本,从而缓解测试集污染问题。其合成特性赋予用户对数据集的完全控制权,通过调整文本长度并在多样化场景中改变任务难度,实现对LLM能力的系统性探测。S3Eval得分与Big-Bench Hard(BBH)等真实世界基准的高度相关性,验证了将其用于LLM评估的有效性。深度分析还揭示了额外发现:当答案稀疏分布或位于上下文中间位置时性能下降,以及部分反直觉的模型性能趋势。