Deep Research (DR) Agents powered by advanced Large Language Models (LLMs) have fundamentally shifted the paradigm for completing complex research tasks. Yet, a comprehensive and live evaluation of their forecasting performance on real-world, research-oriented tasks in high-stakes domains (e.g., finance) remains underexplored. We introduce FinDeepForecast, the first live, end-to-end multi-agent system for automatically evaluating DR agents by continuously generating research-oriented financial forecasting tasks. This system is equipped with a dual-track taxonomy, enabling the dynamic generation of recurrent and non-recurrent forecasting tasks at both corporate and macro levels. With this system, we generate FinDeepForecastBench, a weekly evaluation benchmark over a ten-week horizon, encompassing 8 global economies and 1,314 listed companies, and evaluate 13 representative methods. Extensive experiments show that, while DR agents consistently outperform strong baselines, their performance still falls short of genuine forward-looking financial reasoning. We expect the proposed FinDeepForecast system to consistently facilitate future advancements of DR agents in research-oriented financial forecasting tasks. The benchmark and leaderboard are publicly available on the OpenFinArena Platform.
翻译:基于先进大语言模型(LLMs)的深度研究(DR)智能体已从根本上改变了完成复杂研究任务的范式。然而,针对高风险领域(如金融)中真实世界、研究导向型任务的预测性能,目前仍缺乏全面且实时的评估。我们提出了FinDeepForecast,首个用于自动评估DR智能体的实时端到端多智能体系统,该系统通过持续生成研究导向的金融预测任务来实现评估。该系统配备双轨分类体系,能够动态生成公司层面与宏观层面的周期性及非周期性预测任务。基于此系统,我们构建了FinDeepForecastBench——一个为期十周的每周评估基准,涵盖8个全球经济体及1,314家上市公司,并对13种代表性方法进行了评估。大量实验表明,尽管DR智能体始终优于强基线方法,但其性能仍未能达到真正前瞻性金融推理的水平。我们期望所提出的FinDeepForecast系统能够持续推动DR智能体在研究导向型金融预测任务中的未来发展。该基准及排行榜已在OpenFinArena平台公开提供。