We present the Bayesian Linguistic Forecaster (BLF), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) Linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing, unstructured context. (2) Hierarchical multi-trial aggregation: running $K$ independent trials and combining them using logit-space averaging shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates. On 400 questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Careful ablation studies, using mixed effects analysis to control for question variability (which accounts for 62\% of the variance in performance), reveals that all 3 components contribute to the overall gains, but some components matter more than others, depending on the base LLM, and the setting (e.g.\ with or without a crowd prior). All our experiments are based on a robust back-testing framework which we develop, which has a leakage rate below 1.5\%, and may be of independent interest.
翻译:我们提出了贝叶斯语言预测器(BLF),一种用于二元预测的智能体系统,在ForecastBench基准测试中达到了最先进的性能。该系统基于三个核心思想:(1)语言信念状态:一种半结构化表示,结合了数值概率估计与自然语言证据摘要,并由大语言模型(LLM)在迭代工具使用循环的每一步中更新。这不同于将所有检索到的证据附加到不断增长的、非结构化上下文中的常见方法。(2)分层多试验聚合:运行$K$个独立试验,并使用基于数据先验的对数几率空间平均收缩法进行组合。(3)分层校准:采用具有分层先验的Platt缩放,避免了对基率偏斜的源进行极端预测的过度收缩。在ForecastBench排行榜的400个问题上,BLF优于所有顶尖公开方法,包括Cassi、GPT-5、Grok~4.20和Foresight-32B。通过使用混合效应分析控制问题变异性(占性能方差的62%)进行的仔细消融研究表明,所有三个组件均对整体性能提升有贡献,但某些组件的重要性取决于基础LLM和设置(例如,是否使用群体先验)。我们所有的实验均基于一个稳健的回测框架,该框架的泄漏率低于1.5%,并且可能具有独立的研究价值。