We introduce FinWorkBench (a.k.a. Finch) for evaluating AI agents on real-world, enterprise-grade finance and accounting workflows that interleave data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces from Enron (15,000 files and 500,000 emails) and other financial institutions, covering the period 2000--2025 and preserving the in-the-wild messiness of multimodal artifacts such as tables and charts across diverse domains including budgeting, trading, asset management, and operational management. We propose a workflow construction process that combines LLM-assisted mining of workflows from authentic enterprise environments with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and spreadsheet version histories, and (2) meticulous annotation requiring over 700 hours of expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems, including GPT-5.1, Claude Sonnet 4.5, Claude Opus 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max. Under human evaluation, GPT-5.1 Pro spends an average of 16.8 minutes per workflow yet passes only 38.4% of workflows. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.
翻译:我们推出了FinWorkBench(简称Finch),用于评估AI智能体在真实企业级金融与会计工作流中的表现。这些工作流交织了数据录入、结构化、格式化、网络搜索、跨文件检索、计算、建模、验证、翻译、可视化与报告等环节。Finch的数据来源于安然公司(15,000份文件和500,000封电子邮件)及其他金融机构的真实企业工作空间,覆盖2000至2025年期间,保留了跨多个领域(包括预算、交易、资产管理及运营管理)的多模态工件(如表格和图表)的原始杂乱性。我们提出了一种工作流构建流程,该流程结合了基于大模型的从真实企业环境中挖掘工作流的方法与专家标注:(1)通过大模型辅助、专家验证的方式,从真实电子邮件线程和电子表格版本历史中推导工作流;(2)进行超过700小时人工投入的细致标注。最终生成了172个复合工作流,包含384个任务,涉及1,710份电子表格(共2,700万个单元格)以及PDF等其他工件,捕捉了真实企业工作本质上的杂乱性、长周期、知识密集型与协作性特征。我们对前沿AI系统(包括GPT-5.1、Claude Sonnet 4.5、Claude Opus 4.5、Gemini 3 Pro、Grok 4和Qwen 3 Max)进行了人工评估与自动化评估。人工评估显示,GPT-5.1 Pro每个工作流平均耗时16.8分钟,但仅通过了38.4%的工作流。全面的案例研究进一步揭示了真实企业工作流对AI智能体提出的挑战。