We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing context. (2) Hierarchical multi-trial aggregation: running $K$ independent trials and combining them using logit-space shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates. On 400 backtesting questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Ablation studies show that the structured belief state is as impactful as web search access, and that shrinkage aggregation and hierarchical calibration each provide significant additional gains. In addition, we develop a robust back-testing framework with a leakage rate below 1.5\%, and use rigorous statistical methodology to compare different methods while controlling for various sources of noise.
翻译:我们提出 BLF(贝叶斯语言预测器),一种用于二元预测的智能系统,在 ForecastBench 基准测试中达到了最先进的性能。该系统基于三个核心思想:(1) 贝叶斯语言信念状态:一种半结构化表示,结合数值概率估计与自然语言证据总结,在迭代工具使用循环的每一步由 LLM 进行更新。这与将所有检索到的证据附加到不断增长的上下文的常见方法形成对比。(2) 分层多试验聚合:运行 $K$ 次独立试验,并使用带有数据相关先验的对数几率空间收缩进行组合。(3) 分层校准:采用具有分层先验的 Platt 缩放,避免对来源偏态基率的极端预测进行过度收缩。在 ForecastBench 排行榜的 400 个回测问题上,BLF 优于所有顶级公开方法,包括 Cassi、GPT-5、Grok~4.20 和 Foresight-32B。消融研究表明,结构化信念状态的影响与网络搜索访问相当,且收缩聚合和分层校准各自提供了显著的额外增益。此外,我们开发了一个泄漏率低于 1.5% 的稳健回测框架,并使用严格的统计方法来比较不同方法,同时控制各种噪声源。