Large Language Models (LLMs) have demonstrated proficiency in a wide array of natural language processing tasks. However, its effectiveness over discourse-level event relation extraction (ERE) tasks remains unexplored. In this paper, we assess the effectiveness of LLMs in addressing discourse-level ERE tasks characterized by lengthy documents and intricate relations encompassing coreference, temporal, causal, and subevent types. Evaluation is conducted using an commercial model, GPT-3.5, and an open-source model, LLaMA-2. Our study reveals a notable underperformance of LLMs compared to the baseline established through supervised learning. Although Supervised Fine-Tuning (SFT) can improve LLMs performance, it does not scale well compared to the smaller supervised baseline model. Our quantitative and qualitative analysis shows that LLMs have several weaknesses when applied for extracting event relations, including a tendency to fabricate event mentions, and failures to capture transitivity rules among relations, detect long distance relations, or comprehend contexts with dense event mentions.
翻译:大型语言模型(LLMs)已在众多自然语言处理任务中展现出卓越能力,但其在篇章级事件关系抽取(ERE)任务上的有效性尚未得到充分探索。本文评估了LLMs在处理篇章级ERE任务时的效能,该任务以长文档和复杂关系为特征,涉及共指、时序、因果及子事件等多种关系类型。我们使用商用模型GPT-3.5和开源模型LLaMA-2进行评估。研究发现,与基于监督学习建立的基线模型相比,LLMs的表现存在显著差距。尽管监督微调(SFT)能够提升LLMs的性能,但其扩展性仍不及规模较小的监督基线模型。定量与定性分析表明,LLMs在抽取事件关系时存在若干缺陷,包括倾向于虚构事件提及、难以捕捉关系间的传递性规则、无法检测长距离关系,以及难以理解事件提及密集的上下文。