The rapid evolution of large language models (LLMs) has ushered in the need for comprehensive assessments of their performance across various dimensions. In this paper, we propose LFED, a Literary Fiction Evaluation Dataset, which aims to evaluate the capability of LLMs on the long fiction comprehension and reasoning. We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. Additionally, we conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations. Through a series of experiments with various state-of-the-art LLMs, we demonstrate that these models face considerable challenges in effectively addressing questions related to literary fictions, with ChatGPT reaching only 57.08% under the zero-shot setting. The dataset will be publicly available at https://github.com/tjunlp-lab/LFED.git
翻译:大语言模型(LLMs)的快速发展催生了对其多维度性能进行综合评估的需求。本文提出文学小说评估数据集LFED,旨在评估大语言模型在长篇小说理解与推理方面的能力。我们收集了95部中文原创或译介的文学小说,涵盖多个世纪的广泛题材。通过定义包含8个问题类别的分类体系,我们构建了1,304个问题。此外,我们深入分析了文学小说特定属性(如小说类型、人物数量、出版年份)对LLM评估表现的影响。通过一系列基于多种先进大语言模型的实验表明,这些模型在有效回答文学小说相关问题上面临显著挑战,其中ChatGPT在零样本设置下仅达到57.08%的正确率。该数据集将公开发布于https://github.com/tjunlp-lab/LFED.git。