We introduce ZeroSCROLLS, a zero-shot benchmark for natural language understanding over long texts, which contains only test and small validation sets, without training data. We adapt six tasks from the SCROLLS benchmark, and add four new datasets, including two novel information fusing tasks, such as aggregating the percentage of positive reviews. Using ZeroSCROLLS, we conduct a comprehensive evaluation of both open-source and closed large language models, finding that Claude outperforms ChatGPT, and that GPT-4 achieves the highest average score. However, there is still room for improvement on multiple open challenges in ZeroSCROLLS, such as aggregation tasks, where models struggle to pass the naive baseline. As the state of the art is a moving target, we invite researchers to evaluate their ideas on the live ZeroSCROLLS leaderboard.
翻译:我们提出ZeroSCROLLS,一个针对长文本自然语言理解的零样本基准测试,该基准仅包含测试集和小型验证集,不包含训练数据。我们改进了SCROLLS基准中的六项任务,并新增四个数据集,包括两项新型信息融合任务(如聚合正面评价占比)。基于ZeroSCROLLS,我们对开源和闭源大型语言模型进行了全面评估,发现Claude表现优于ChatGPT,而GPT-4取得了最高平均分。然而,ZeroSCROLLS中的多项开放性挑战仍有改进空间,例如在聚合任务中,模型难以超越简单基准线。鉴于当前最优方法仍在持续演进,我们诚邀研究者通过实时更新的ZeroSCROLLS排行榜测试其创新方案。