We introduce ZeroSCROLLS, a zero-shot benchmark for natural language understanding over long texts, which contains only test and small validation sets, without training data. We adapt six tasks from the SCROLLS benchmark, and add four new datasets, including two novel information fusing tasks, such as aggregating the percentage of positive reviews. Using ZeroSCROLLS, we conduct a comprehensive evaluation of both open-source and closed large language models, finding that Claude outperforms ChatGPT, and that GPT-4 achieves the highest average score. However, there is still room for improvement on multiple open challenges in ZeroSCROLLS, such as aggregation tasks, where models struggle to pass the naive baseline. As the state of the art is a moving target, we invite researchers to evaluate their ideas on the live ZeroSCROLLS leaderboard.
翻译:我们提出ZeroSCROLLS,一个用于长文本自然语言理解的零样本基准,仅包含测试集和少量验证集,不提供训练数据。我们改编了SCROLLS基准中的六项任务,并新增四个数据集,其中包括两项新颖的信息融合任务(如聚合正面评价百分比)。利用ZeroSCROLLS,我们对开源和闭源大型语言模型进行了全面评估,发现Claude优于ChatGPT,而GPT-4获得了最高平均得分。然而,ZeroSCROLLS中的多项开放挑战(如聚合任务)仍有改进空间,现有模型在基准测试中难以超越朴素基线。鉴于当前最优方法仍在动态演进,我们邀请研究者通过实时更新的ZeroSCROLLS排行榜验证其方案。