Deep text understanding, which requires the connections between a given document and prior knowledge beyond its text, has been highlighted by many benchmarks in recent years. However, these benchmarks have encountered two major limitations. On the one hand, most of them require human annotation of knowledge, which leads to limited knowledge coverage. On the other hand, they usually use choices or spans in the texts as the answers, which results in narrow answer space. To overcome these limitations, we build a new challenging benchmark named KoRc in this paper. Compared with previous benchmarks, KoRC has two advantages, i.e., broad knowledge coverage and flexible answer format. Specifically, we utilize massive knowledge bases to guide annotators or large language models (LLMs) to construct knowledgable questions. Moreover, we use labels in knowledge bases rather than spans or choices as the final answers. We test state-of-the-art models on KoRC and the experimental results show that the strongest baseline only achieves 68.3% and 30.0% F1 measure in the in-distribution and out-of-distribution test set, respectively. These results indicate that deep text understanding is still an unsolved challenge. The benchmark dataset, leaderboard, and baseline methods are released in https://github.com/THU-KEG/KoRC.
翻译:深度文本理解需要将给定文档与文本之外的先验知识建立联系,近年来许多基准测试已强调这一方向。然而,现有基准存在两大局限:一方面,大多数基准需要人工标注知识,导致知识覆盖范围有限;另一方面,它们通常以文本中的选项或片段作为答案,导致答案空间狭窄。为克服这些局限,本文构建了一个名为KoRC的新型挑战性基准。与以往基准相比,KoRC具备两大优势:广泛的知识覆盖范围和灵活的答案格式。具体而言,我们利用大规模知识库指导标注者或大语言模型构造知识型问题,并采用知识库中的标签而非文本片段或选项作为最终答案。我们在KoRC上测试了当前最优模型,实验结果表明,最强基线在分布内测试集和分布外测试集上的F1值分别仅为68.3%和30.0%。这些结果说明深度文本理解仍是一项尚未解决的挑战。基准数据集、排行榜及基线方法已发布在https://github.com/THU-KEG/KoRC。