High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.
翻译:高风险决策涉及在不确定性条件下对未来进行推理。本研究通过训练语言模型对开放式预测问题作出预测。为扩大训练数据规模,我们基于每日新闻报道的全球事件,采用全自动、精细化的数据构建方法,合成了新型预测问题。我们在自建数据集OpenForesight上训练了Qwen3思维模型。为防止训练与评估过程中的未来信息泄露,我们使用离线新闻语料库,同时服务于数据生成和预测系统中的检索模块。通过小规模验证集的指导,我们证明了检索机制的优势以及改进后的强化学习奖励函数的有效性。在获得最终预测系统后,我们在2025年5月至8月期间进行了留出测试。我们的专用模型OpenForecaster 8B达到了与规模更大的专有模型相当的性能,其训练过程显著提升了预测的准确性、校准度和一致性。研究发现,预测训练带来的校准改进可泛化至多个主流基准测试。我们开源了所有模型、代码与数据,以推动语言模型预测研究的广泛发展。