A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting

Recently, Large Language Models (LLMs) have demonstrated great potential in various data mining tasks, such as knowledge question answering, mathematical reasoning, and commonsense reasoning. However, the reasoning capability of LLMs on temporal event forecasting has been under-explored. To systematically investigate their abilities in temporal event forecasting, we conduct a comprehensive evaluation of LLM-based methods for temporal event forecasting. Due to the lack of a high-quality dataset that involves both graph and textual data, we first construct a benchmark dataset, named MidEast-TE-mini. Based on this dataset, we design a series of baseline methods, characterized by various input formats and retrieval augmented generation(RAG) modules. From extensive experiments, we find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance. In contrast, incorporating raw texts in specific complex events and fine-tuning LLMs significantly improves performance. Moreover, enhanced with retrieval modules, LLM can effectively capture temporal relational patterns hidden in historical events. Meanwhile, issues such as popularity bias and the long-tail problem still persist in LLMs, particularly in the RAG-based method. These findings not only deepen our understanding of LLM-based event forecasting methods but also highlight several promising research directions.We consider that this comprehensive evaluation, along with the identified research opportunities, will significantly contribute to future research on temporal event forecasting through LLMs.

翻译：近年来，大型语言模型（LLMs）在知识问答、数学推理和常识推理等多种数据挖掘任务中展现出巨大潜力。然而，LLMs在时序事件预测方面的推理能力尚未得到充分探索。为系统研究其在时序事件预测中的能力，我们对基于LLM的时序事件预测方法进行了综合评估。由于缺乏同时包含图数据与文本数据的高质量数据集，我们首先构建了一个名为MidEast-TE-mini的基准数据集。基于该数据集，我们设计了一系列基线方法，其特点在于多样化的输入格式与检索增强生成（RAG）模块。通过大量实验发现，直接将原始文本整合至LLM输入中并不能提升零样本外推性能。相反，在特定复杂事件中引入原始文本并对LLM进行微调可显著提升性能。此外，通过检索模块增强后，LLM能有效捕捉历史事件中隐含的时序关系模式。与此同时，流行度偏差与长尾问题在LLMs中依然存在，尤其在基于RAG的方法中更为明显。这些发现不仅深化了我们对基于LLM的事件预测方法的理解，也指明了若干具有前景的研究方向。我们认为，本次综合评估及所揭示的研究机遇，将对未来通过LLM进行时序事件预测的研究作出重要贡献。