Predicting future events stands as one of the ultimate aspirations of artificial intelligence. Recent advances in large language model (LLM)-based systems have shown remarkable potential in forecasting future events, thereby garnering significant interest in the research community. Currently, several benchmarks have been established to evaluate the forecasting capabilities by formalizing the event prediction as a retrieval-augmented generation (RAG) and reasoning task. In these benchmarks, each prediction question is answered with relevant retrieved news articles. However, because there is no consideration on whether the questions can be supported by valid or sufficient supporting rationales, some of the questions in these benchmarks may be inherently noninferable. To address this issue, we introduce a new benchmark, PROPHET, which comprises inferable forecasting questions paired with relevant news for retrieval. To ensure the inferability of the benchmark, we propose Causal Intervened Likelihood (CIL), a statistical measure that assesses inferability through causal inference. In constructing this benchmark, we first collected recent trend forecasting questions and then filtered the data using CIL, resulting in an inferable benchmark for event prediction. Through extensive experiments, we first demonstrate the validity of CIL and in-depth investigations into event prediction with the aid of CIL. Subsequently, we evaluate several representative prediction systems on PROPHET, drawing valuable insights for future directions.
翻译:预测未来事件是人工智能的终极目标之一。基于大语言模型(LLM)的系统的最新进展在预测未来事件方面展现出巨大潜力,从而引起了研究界的广泛关注。目前,已有多个基准被建立,通过将事件预测形式化为检索增强生成(RAG)与推理任务来评估预测能力。在这些基准中,每个预测问题都通过检索到的相关新闻文章来回答。然而,由于未考虑问题是否能够得到有效或充分的支撑理由支持,这些基准中的部分问题可能本质上是不可推断的。为解决此问题,我们引入了一个新的基准PROPHET,它包含可推断的预测问题以及用于检索的相关新闻。为确保基准的可推断性,我们提出了因果干预似然(CIL),这是一种通过因果推理评估可推断性的统计度量。在构建该基准时,我们首先收集了近期趋势预测问题,然后使用CIL对数据进行筛选,最终得到一个用于事件预测的可推断基准。通过大量实验,我们首先验证了CIL的有效性,并借助CIL对事件预测进行了深入探究。随后,我们在PROPHET上评估了多个代表性预测系统,为未来研究方向提供了有价值的见解。