Populating Commonsense Knowledge Bases (CSKB) is an important yet hard task in NLP, as it tackles knowledge from external sources with unseen events and entities. Fang et al. (2021a) proposed a CSKB Population benchmark with an evaluation set CKBP v1. However, CKBP v1 adopts crowdsourced annotations that suffer from a substantial fraction of incorrect answers, and the evaluation set is not well-aligned with the external knowledge source as a result of random sampling. In this paper, we introduce CKBP v2, a new high-quality CSKB Population benchmark, which addresses the two mentioned problems by using experts instead of crowd-sourced annotation and by adding diversified adversarial samples to make the evaluation set more representative. We conduct extensive experiments comparing state-of-the-art methods for CSKB Population on the new evaluation set for future research comparisons. Empirical results show that the population task is still challenging, even for large language models (LLM) such as ChatGPT. Codes and data are available at https://github.com/HKUST-KnowComp/CSKB-Population.
翻译:常识知识库(CSKB)补全是自然语言处理领域中重要但困难的任务,因为它需要处理来自外部来源、包含未见事件和实体的知识。Fang等人(2021a)提出了一个CSKB补全基准任务及配套评估集CKBP v1。然而,CKBP v1采用众包标注方式导致存在大量错误答案,且因随机采样导致评估集与外部知识来源的对齐度不佳。本文提出高质量CSKB补全新基准CKBP v2,通过采用专家标注替代众包方式,并引入多样化对抗样本来提升评估集代表性,解决了上述两个问题。我们通过大量实验比较了当前最先进的CSKB补全方法在新评估集上的表现,为后续研究提供对比基准。实验结果表明,即使面对ChatGPT等大语言模型(LLM),补全任务仍然具有挑战性。相关代码与数据已开源至https://github.com/HKUST-KnowComp/CSKB-Population。