Finch：面向电子表格中心化企业工作流的财务与会计基准测试 (Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows)

We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management. We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.

翻译：我们提出了一个财务与会计基准测试（Finch），用于评估AI智能体在真实世界企业级专业工作流上的表现——这些工作流交织着数据录入、结构化、格式化、网络搜索、跨文件检索、计算、建模、验证、转换、可视化与报告等任务。Finch的数据源自安然公司（来自150名员工的15,000份电子表格和500,000封电子邮件）及其他金融机构的真实企业工作空间，保留了多模态工件（文本、表格、公式、图表、代码和图像）在真实环境中的杂乱性，并涵盖预算、交易和资产管理等多个领域。我们提出了一种结合LLM辅助发现与专家标注的工作流构建流程：（1）通过LLM辅助、专家验证的方式，从真实世界的电子邮件线程和电子表格文件的版本历史中推导工作流；（2）对工作流进行细致的专家标注，耗费了超过700小时的领域专家工时。由此产生了包含384个任务的172个复合工作流，涉及1,710个电子表格（共2,700万个单元格）以及PDF和其他工件，捕捉了真实企业工作中固有的杂乱性、长周期、知识密集和协作特性。我们对前沿AI系统进行了人工与自动化评估，包括GPT 5.1、Claude Sonnet 4.5、Gemini 3 Pro、Grok 4和Qwen 3 Max。其中GPT 5.1 Pro总计耗时48小时，仅能通过38.4%的工作流，而Claude Sonnet 4.5仅通过25.0%。全面的案例研究进一步揭示了真实企业工作流对AI智能体构成的挑战。