Event reasoning is a fundamental ability that underlies many applications. It requires event schema knowledge to perform global reasoning and needs to deal with the diversity of the inter-event relations and the reasoning paradigms. How well LLMs accomplish event reasoning on various relations and reasoning paradigms remains unknown. To mitigate this disparity, we comprehensively evaluate the abilities of event reasoning of LLMs. We introduce a novel benchmark EV2 for EValuation of EVent reasoning. EV2 consists of two levels of evaluation of schema and instance and is comprehensive in relations and reasoning paradigms. We conduct extensive experiments on EV2. We find that LLMs have abilities to accomplish event reasoning but their performances are far from satisfactory. We also notice the imbalance of event reasoning abilities in LLMs. Besides, LLMs have event schema knowledge, however, they're not aligned with humans on how to utilize the knowledge. Based on these findings, we guide the LLMs in utilizing the event schema knowledge as memory leading to improvements on event reasoning.
翻译:事件推理是支撑众多应用的基础能力。它需要事件图式知识以进行全局推理,并需处理事件间关系的多样性与推理范式的差异性。目前大型语言模型在不同关系类型与推理范式上的事件推理表现尚不明确。为弥补这一认知差距,我们系统评估了大型语言模型的事件推理能力。我们提出了新颖的EV2基准测试体系,专门用于事件推理能力评估。EV2包含图式与实例两个评估层级,全面覆盖各类关系与推理范式。我们在EV2上进行了大量实验,发现大型语言模型虽具备事件推理能力,但其表现远未达到理想水平。同时我们观察到模型在事件推理能力上存在不均衡现象。此外,大型语言模型虽掌握事件图式知识,但其知识运用方式与人类存在偏差。基于这些发现,我们引导模型将事件图式知识作为记忆模块进行利用,从而提升了事件推理性能。