To fully evaluate the overall performance of different NLP models in a given domain, many evaluation benchmarks are proposed, such as GLUE, SuperGLUE and CLUE. The fi eld of natural language understanding has traditionally focused on benchmarks for various tasks in languages such as Chinese, English, and multilingua, however, there has been a lack of attention given to the area of classical Chinese, also known as "wen yan wen", which has a rich history spanning thousands of years and holds signifi cant cultural and academic value. For the prosperity of the NLP community, in this paper, we introduce the WYWEB evaluation benchmark, which consists of nine NLP tasks in classical Chinese, implementing sentence classifi cation, sequence labeling, reading comprehension, and machine translation. We evaluate the existing pre-trained language models, which are all struggling with this benchmark. We also introduce a number of supplementary datasets and additional tools to help facilitate further progress on classical Chinese NLU. The github repository is https://github.com/baudzhou/WYWEB.
翻译:为全面评估不同自然语言处理模型在特定领域的综合性能,学界提出了诸多评估基准,例如GLUE、SuperGLUE和CLUE。传统自然语言理解领域的研究主要聚焦于中文、英文及多语言任务基准,但对拥有数千年悠久历史、承载重要文化与学术价值的古汉语(亦称“文言文”)关注明显不足。为促进自然语言处理社区的繁荣发展,本文提出WYWEB评估基准——该基准涵盖九项古汉语自然语言处理任务,涉及句子分类、序列标注、机器阅读理解与机器翻译。我们对现有预训练语言模型进行了评估,发现这些模型在该基准上均表现欠佳。此外,我们额外提供一系列补充数据集与辅助工具,以期推动古汉语自然语言理解的进一步发展。项目代码仓库地址为https://github.com/baudzhou/WYWEB。