Predicting future events based on news on the Web stands as one of the ultimate aspirations of artificial intelligence. Recent advances in large language model (LLM)-based systems have shown remarkable potential in forecasting future events, thereby garnering significant interest in the research community. Currently, several benchmarks have been established to evaluate the forecasting capabilities by formalizing the event prediction as a retrieval-augmented generation (RAG)-and-reasoning task. In these benchmarks, each prediction question is answered with relevant retrieved news articles downloaded from the Web. However, because there is no consideration of whether the questions can be supported by valid or sufficient supporting rationales, some of the questions in these benchmarks may be inherently noninferable. To address this issue, we introduce a new benchmark, PROPHET, which comprises inferable forecasting questions paired with relevant news for retrieval. To ensure the inferability of the benchmark, we propose Causal Intervened Likelihood (CIL), a statistical measure that assesses inferability through causal inference. In constructing this benchmark, we first collected recent trend forecasting questions, and then filtered the data using CIL resulting in an inferable benchmark for future forecasting. Through extensive experiments, we first demonstrate the validity of CIL and in-depth investigations into future forecasting with the aid of CIL. Subsequently, we evaluate several representative prediction methods on PROPHET. The overall results draws valuable insights for task of future directions.
翻译:基于网络新闻预测未来事件是人工智能的终极目标之一。近期基于大语言模型(LLM)的系统在预测未来事件方面展现出巨大潜力,引起了研究界的广泛关注。目前已有多个基准通过将事件预测形式化为检索增强生成(RAG)与推理任务来评估预测能力。在这些基准中,每个预测问题都通过从网络检索下载的相关新闻文章进行回答。然而,由于未考虑问题是否具备有效或充分的支撑依据,这些基准中的部分问题可能本质上是不可推断的。为解决这一问题,我们提出了新基准PROPHET,其中包含可推断的预测问题及其对应可检索的相关新闻。为确保基准的可推断性,我们提出了因果干预似然(CIL)这一通过因果推理评估可推断性的统计度量。在构建该基准时,我们首先收集了近期趋势预测问题,随后利用CIL对数据进行筛选,最终构建出面向未来预测的可推断基准。通过大量实验,我们首先验证了CIL的有效性,并借助CIL对未来预测进行了深入探究。随后,我们在PROPHET上评估了多种代表性预测方法。整体结果为该任务的未来发展方向提供了重要启示。