Language models (LMs) trained on web-scale datasets are largely successful due to their ability to memorize large amounts of training data, even if only present in a few examples. These capabilities are often desirable in evaluation on tasks such as question answering but raise questions about whether these models can exhibit genuine reasoning or succeed only at mimicking patterns from the training data. This distinction is particularly salient in forecasting tasks, where the answer is not present in the training data, and the model must reason to make logical deductions. We present Reasoning and Tools for Forecasting (RTF), a framework of reasoning-and-acting (ReAct) agents that can dynamically retrieve updated information and run numerical simulation with equipped tools. We evaluate our model with questions from competitive forecasting platforms and demonstrate that our method is competitive with and can outperform human predictions. This suggests that LMs, with the right tools, can indeed think and adapt like humans, offering valuable insights for real-world decision-making.
翻译:在网页规模数据集上训练的语言模型之所以取得巨大成功,很大程度上归功于其记忆大量训练数据的能力,即使这些数据仅出现于少数样本中。这种能力在问答等任务的评估中通常是可取的,但也引发了疑问:这些模型是否能展现真正的推理能力,还是仅仅在模仿训练数据中的模式。这一区别在预测任务中尤为突出,因为答案并不存在于训练数据中,模型必须通过推理进行逻辑演绎。我们提出了用于预测的推理与工具框架,这是一个推理与行动智能体框架,能够动态检索更新信息,并利用配备的工具运行数值模拟。我们使用来自竞争性预测平台的问题评估了我们的模型,并证明我们的方法可与人类预测相媲美,甚至能够超越人类预测。这表明,配备适当工具的语言模型确实能够像人类一样思考与适应,为现实世界的决策提供有价值的洞见。