OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning - 专知论文

会员服务 ·

0

Agent · 端到端 · Less · 表示 · AI ·

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning

翻译：暂无翻译

Krista Opsahl-Ong,Arnav Singhvi,Jasmine Collins,Ivan Zhou,Cindy Wang,Ashutosh Baheti,Owen Oertell,Jacob Portes,Sam Havens,Erich Elsen,Michael Bendersky,Matei Zaharia,Xing Chen

from arxiv, 24 pages, 16 figures. Introduces the OfficeQA Pro benchmark for grounded reasoning over enterprise documents

We introduce OfficeQA Pro, a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. OfficeQA Pro consists of 133 questions that require precise document parsing, retrieval, and analytical reasoning across both unstructured text and tabular data. Frontier LLMs including Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro Preview achieve less than 5% accuracy on OfficeQA Pro when relying on parametric knowledge, and less than 12% with additional access to the web. When provided directly with the document corpus, frontier agents still struggle on over half of questions, scoring 34.1% on average. We find that providing agents with a structured document representation produced by Databricks' ai_parse_document yields a 16.1% average relative performance gain across agents. We conduct additional ablations to study the effects of model selection, table representation, retrieval strategy, and test-time scaling on performance. Despite these improvements, significant headroom remains before agents can be considered reliable at enterprise-grade grounded reasoning.

翻译：暂无翻译

0

相关内容

Agent

Google《AI智能体企业应用手册报告》，46页pdf

Google《AI智能体企业应用手册报告》，46页pdf

专知会员服务

46+阅读 · 2025年12月29日

CVPR 2025 Highlight | OmniManip：以对象为中心的机器人通用操作框架

CVPR 2025 Highlight | OmniManip：以对象为中心的机器人通用操作框架

专知会员服务

9+阅读 · 2025年4月15日

AAAI2025｜TrustUQA：统一结构化数据问答的可信框架

AAAI2025｜TrustUQA：统一结构化数据问答的可信框架

专知会员服务

20+阅读 · 2024年12月20日

《生成式AI企业应用落地技术白皮书》发布，77页pdf

《生成式AI企业应用落地技术白皮书》发布，77页pdf

专知会员服务

99+阅读 · 2023年11月2日

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

最新《图嵌入组合优化》综述论文，40页pdf

最新《图嵌入组合优化》综述论文，40页pdf

专知会员服务

35+阅读 · 2020年9月7日

【CIKM 2019论文】基于关系型图卷积网络的代理发起的社会化电子商务推荐（Relation-Aware Graph Convolutional Networks for Agent-Initiated Social E-Commerce Recommendation）

【CIKM 2019论文】基于关系型图卷积网络的代理发起的社会化电子商务推荐（Relation-Aware Graph Convolutional Networks for Agent-Initiated Social E-Commerce Recommendation）

专知会员服务

56+阅读 · 2019年11月20日

【CIKM 2019 Tutorial】Enterprise Knowledge Graph From Specific Business Task to Enterprise Knowledge Management(企业知识图谱：从特定业务任务到企业知识管理)，华为 Rong Duan ，复旦大学肖仰华，附139页PPT

【CIKM 2019 Tutorial】Enterprise Knowledge Graph From Specific Business Task to Enterprise Knowledge Management(企业知识图谱：从特定业务任务到企业知识管理)，华为 Rong Duan ，复旦大学肖仰华，附139页PPT

专知会员服务

38+阅读 · 2019年11月3日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

【IJCAI 2019】耦合一切:建立最先进的推荐系统通用指南（Coupling Everything: A Universal Guideline for Building State-of-The-Art Recommender Systems），操龙兵教授，Liang Hu

【IJCAI 2019】耦合一切:建立最先进的推荐系统通用指南（Coupling Everything: A Universal Guideline for Building State-of-The-Art Recommender Systems），操龙兵教授，Liang Hu

专知会员服务

17+阅读 · 2019年8月11日

【CVPR2021】基于反事实推断的视觉问答框架

【CVPR2021】基于反事实推断的视觉问答框架

专知

38+阅读 · 2021年3月4日

【DeepMind】多智能体学习231页PPT总结

【DeepMind】多智能体学习231页PPT总结

深度强化学习实验室

16+阅读 · 2020年6月23日

论文浅尝 | XQA：一个跨语言开放域问答数据集

论文浅尝 | XQA：一个跨语言开放域问答数据集

开放知识图谱

26+阅读 · 2019年9月11日

微软亚研提出VL-BERT：通用的视觉-语言预训练模型

微软亚研提出VL-BERT：通用的视觉-语言预训练模型

机器之心

15+阅读 · 2019年9月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

微软亚洲研究院周明老师48页《自然语言处理：进展，机会，挑战》PPT

微软亚洲研究院周明老师48页《自然语言处理：进展，机会，挑战》PPT

专知

10+阅读 · 2018年11月29日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

文本分类又来了，用 Scikit-Learn 解决多类文本分类问题

文本分类又来了，用 Scikit-Learn 解决多类文本分类问题

AI研习社

14+阅读 · 2018年7月22日

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

AI研习社

21+阅读 · 2018年6月14日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

战略构想、知识搜寻与双元导向下企业技术创新能力演进：基于适应性演进和协同视角

国家自然科学基金

2+阅读 · 2015年12月31日

考虑产品间协同效应的多产品组合采购问题研究

国家自然科学基金

1+阅读 · 2015年12月31日

企业区位选择与中国西部城市空间重构

国家自然科学基金

0+阅读 · 2015年12月31日

强调与对比影响语篇理解的认知过程及其神经机制

国家自然科学基金

4+阅读 · 2015年12月31日

可与MPSoC高度融合的片上自主测试-自主修复关键技术研究：针对自然、人为可靠性威胁

国家自然科学基金

0+阅读 · 2015年12月31日

服务性企业员工正面心理资本、敬业程度和工作绩效的动态关系——基于双人组层面的研究

国家自然科学基金

0+阅读 · 2014年12月31日

多部门机构下的生产规划与资源配置

国家自然科学基金

3+阅读 · 2014年12月31日

RFID跨企业集成中下游驱动的供应链动态博弈与协调研究

国家自然科学基金

1+阅读 · 2014年12月31日

复杂生产制造环境下的排序问题研究

国家自然科学基金

0+阅读 · 2014年12月31日

外包与云计算情境下IT业务匹配研究：适应性结构化理论视角

国家自然科学基金

2+阅读 · 2014年12月31日

RAGPerf: An End-to-End Benchmarking Framework for Retrieval-Augmented Generation Systems

Arxiv

0+阅读 · 3月11日

IronEngine: Towards General AI Assistant

Arxiv

0+阅读 · 3月9日

SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration

Arxiv

0+阅读 · 3月9日

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Arxiv

0+阅读 · 3月7日

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Arxiv

0+阅读 · 3月5日

Publication and Maintenance of Relational Data in Enterprise Knowledge Graphs (Revised Version)

Arxiv

0+阅读 · 3月4日

OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

Arxiv

0+阅读 · 2月23日

Group Representational Position Encoding

Arxiv

0+阅读 · 2月23日

VQA and Visual Reasoning: An Overview of Recent Datasets, Methods and Challenges

Arxiv

11+阅读 · 2022年12月26日

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

Arxiv

10+阅读 · 2019年9月4日

VIP会员

文章信息

相关主题

相关VIP内容

Google《AI智能体企业应用手册报告》，46页pdf

Google《AI智能体企业应用手册报告》，46页pdf

专知会员服务

46+阅读 · 2025年12月29日

CVPR 2025 Highlight | OmniManip：以对象为中心的机器人通用操作框架

CVPR 2025 Highlight | OmniManip：以对象为中心的机器人通用操作框架

专知会员服务

9+阅读 · 2025年4月15日

AAAI2025｜TrustUQA：统一结构化数据问答的可信框架

AAAI2025｜TrustUQA：统一结构化数据问答的可信框架

专知会员服务

20+阅读 · 2024年12月20日

《生成式AI企业应用落地技术白皮书》发布，77页pdf

《生成式AI企业应用落地技术白皮书》发布，77页pdf

专知会员服务

99+阅读 · 2023年11月2日

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

最新《图嵌入组合优化》综述论文，40页pdf

最新《图嵌入组合优化》综述论文，40页pdf

专知会员服务

35+阅读 · 2020年9月7日

【CIKM 2019论文】基于关系型图卷积网络的代理发起的社会化电子商务推荐（Relation-Aware Graph Convolutional Networks for Agent-Initiated Social E-Commerce Recommendation）

【CIKM 2019论文】基于关系型图卷积网络的代理发起的社会化电子商务推荐（Relation-Aware Graph Convolutional Networks for Agent-Initiated Social E-Commerce Recommendation）

专知会员服务

56+阅读 · 2019年11月20日

【CIKM 2019 Tutorial】Enterprise Knowledge Graph From Specific Business Task to Enterprise Knowledge Management(企业知识图谱：从特定业务任务到企业知识管理)，华为 Rong Duan ，复旦大学肖仰华，附139页PPT

【CIKM 2019 Tutorial】Enterprise Knowledge Graph From Specific Business Task to Enterprise Knowledge Management(企业知识图谱：从特定业务任务到企业知识管理)，华为 Rong Duan ，复旦大学肖仰华，附139页PPT

专知会员服务

38+阅读 · 2019年11月3日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

【IJCAI 2019】耦合一切:建立最先进的推荐系统通用指南（Coupling Everything: A Universal Guideline for Building State-of-The-Art Recommender Systems），操龙兵教授，Liang Hu

【IJCAI 2019】耦合一切:建立最先进的推荐系统通用指南（Coupling Everything: A Universal Guideline for Building State-of-The-Art Recommender Systems），操龙兵教授，Liang Hu

专知会员服务

17+阅读 · 2019年8月11日

热门VIP内容

开通专知VIP会员享更多权益服务

《不对称消耗：乌克兰与伊朗“沙赫德”项目中低成本无人机作战的定量分析（2022-2026年）》2026最新358页

《美陆军条令：野战炮兵营作战》2026版

谷歌Gemini军事AI扩展至五角大楼上百万人员，取代Anthropic

《多智能体影响图在混合威胁建模中的应用》最新30页报告

相关资讯

【CVPR2021】基于反事实推断的视觉问答框架

【CVPR2021】基于反事实推断的视觉问答框架

专知

38+阅读 · 2021年3月4日

【DeepMind】多智能体学习231页PPT总结

【DeepMind】多智能体学习231页PPT总结

深度强化学习实验室

16+阅读 · 2020年6月23日

论文浅尝 | XQA：一个跨语言开放域问答数据集

论文浅尝 | XQA：一个跨语言开放域问答数据集

开放知识图谱

26+阅读 · 2019年9月11日

微软亚研提出VL-BERT：通用的视觉-语言预训练模型

微软亚研提出VL-BERT：通用的视觉-语言预训练模型

机器之心

15+阅读 · 2019年9月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

微软亚洲研究院周明老师48页《自然语言处理：进展，机会，挑战》PPT

微软亚洲研究院周明老师48页《自然语言处理：进展，机会，挑战》PPT

专知

10+阅读 · 2018年11月29日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

文本分类又来了，用 Scikit-Learn 解决多类文本分类问题

文本分类又来了，用 Scikit-Learn 解决多类文本分类问题

AI研习社

14+阅读 · 2018年7月22日

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

Word2Vec —— 深度学习的一小步，自然语言处理的一大步

AI研习社

21+阅读 · 2018年6月14日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

相关论文

RAGPerf: An End-to-End Benchmarking Framework for Retrieval-Augmented Generation Systems

Arxiv

0+阅读 · 3月11日

IronEngine: Towards General AI Assistant

Arxiv

0+阅读 · 3月9日

SplitAgent: A Privacy-Preserving Distributed Architecture for Enterprise-Cloud Agent Collaboration

Arxiv

0+阅读 · 3月9日

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Arxiv

0+阅读 · 3月7日

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Arxiv

0+阅读 · 3月5日

Publication and Maintenance of Relational Data in Enterprise Knowledge Graphs (Revised Version)

Arxiv

0+阅读 · 3月4日

OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

Arxiv

0+阅读 · 2月23日

Group Representational Position Encoding

Arxiv

0+阅读 · 2月23日

VQA and Visual Reasoning: An Overview of Recent Datasets, Methods and Challenges

Arxiv

11+阅读 · 2022年12月26日

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

Arxiv

10+阅读 · 2019年9月4日

相关基金

战略构想、知识搜寻与双元导向下企业技术创新能力演进：基于适应性演进和协同视角

国家自然科学基金

2+阅读 · 2015年12月31日

考虑产品间协同效应的多产品组合采购问题研究

国家自然科学基金

1+阅读 · 2015年12月31日

企业区位选择与中国西部城市空间重构

国家自然科学基金

0+阅读 · 2015年12月31日

强调与对比影响语篇理解的认知过程及其神经机制

国家自然科学基金

4+阅读 · 2015年12月31日

可与MPSoC高度融合的片上自主测试-自主修复关键技术研究：针对自然、人为可靠性威胁

国家自然科学基金

0+阅读 · 2015年12月31日

服务性企业员工正面心理资本、敬业程度和工作绩效的动态关系——基于双人组层面的研究

国家自然科学基金

0+阅读 · 2014年12月31日

多部门机构下的生产规划与资源配置

国家自然科学基金

3+阅读 · 2014年12月31日

RFID跨企业集成中下游驱动的供应链动态博弈与协调研究

国家自然科学基金

1+阅读 · 2014年12月31日

复杂生产制造环境下的排序问题研究

国家自然科学基金

0+阅读 · 2014年12月31日

外包与云计算情境下IT业务匹配研究：适应性结构化理论视角

国家自然科学基金

2+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员