We use Group Relative Policy Optimization (GRPO), a recently devised sample and memory efficient reinforcement learning method, to finetune pretrained LLMs in the range of 1.5B to 14B parameters equipped with the ability to get current information through the use of a Wikipedia revisions tool, or news summaries, to forecast real events beyond the knowledge cutoff of the LLM, as well as problems made to simulate different aspects of the dynamics of that training. We use the results of these experiments to comment on the scaling capability of LLMs for forecasting, as well as classify how judgmental forecasting fits into the verifiable/unverifiable domain taxonomy, considering the impact of the inherent aleatoric uncertainty when forecasting future events (e.g. the roll of a die). As a result of the GRPO training, we manage to bring a 1.5B parameter transformer (Qwen 2.5 1.5B) to forecasting performance superior to Claude Sonnet 3.5 over the same dataset as measured by cross entropy from the market agreed probabilities. We also discuss various dead ends on the path to this result.
翻译:我们采用最近提出的样本与内存高效的强化学习方法——组相对策略优化(GRPO),对参数规模在1.5B至14B之间的预训练LLM进行微调,使其具备通过维基百科修订工具或新闻摘要获取实时信息的能力,从而预测超出LLM知识截止日期的真实事件,并处理模拟训练动态不同方面的问题。基于实验结果,我们探讨了LLM在预测任务中的扩展能力,同时结合预测未来事件时固有的偶然不确定性(例如掷骰子),对判断性预测在可验证/不可验证领域分类中的定位进行了分析。通过GRPO训练,我们成功使1.5B参数的Transformer模型(Qwen 2.5 1.5B)在基于市场约定概率的交叉熵指标上,其预测性能超越了同一数据集上的Claude Sonnet 3.5。此外,我们还讨论了实现该结果过程中遭遇的各种技术瓶颈。