Recently, AI agents are rapidly evolving in intelligence and widely used in professional research applications, such as STEM, software development, and finance. Among these AI agents, deep research agent is a key category as it can perform long-horizon tasks and solve problems of greater complexity. However, there are few evaluation frameworks and benchmarks that systematically and automatically investigate the capabilities of these research agents. In addition, financial research problems have distinct complexity and subtlety. To fill in the gap, we propose FinResearchBench, which is a logic tree-based Agent-as-a-Judge and targets specifically for the financial research agents. It provides a comprehensive and automatic assessment of the research agents across 7 key types of tasks in the financial research domain. The contributions of this work are two-folded: (1) the first and innovative Agent-as-a-Judge system that extracts the logic tree of the research outcome and uses it as the intermediate information to present a comprehensive, reliable, and robust evaluation; (2) finance-oriented that it covers 70 typical financial research questions, spreading across 7 frequently encountered types of task in the domain.
翻译:近年来,AI智能体在智能水平上快速发展,并广泛应用于专业研究领域,如STEM、软件开发和金融。在这些AI智能体中,深度研究智能体是关键类别,因其能够执行长周期任务并解决更复杂的问题。然而,目前缺乏能够系统且自动评估这些研究智能体能力的框架与基准。此外,金融研究问题具有独特的复杂性与微妙性。为填补这一空白,我们提出了FinResearchBench,这是一个基于逻辑树的Agent-as-a-Judge评估框架,专门针对金融研究智能体设计。该框架在金融研究领域的7类关键任务上,为研究智能体提供全面且自动化的能力评估。本工作的贡献主要体现在两方面:(1)首创并实现了创新的Agent-as-a-Judge系统,该系统提取研究结果的逻辑树,并将其作为中间信息,以提供全面、可靠且鲁棒的评价;(2)框架以金融为导向,涵盖了70个典型的金融研究问题,这些问题分布于该领域7类常见任务类型中。