Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
翻译:多模态智能体为自动化复杂的文档密集型工作流程提供了有前景的路径。然而,一个关键问题仍然存在:这些智能体是展现出真正的策略性推理,还是仅仅是随机的试错搜索?为解决此问题,我们提出了MADQA,这是一个基于800份异构PDF文档构建的包含2250个人工提问的基准测试。在经典测试理论指导下,我们设计该基准以最大化对不同层次智能体能力的区分能力。为了评估智能体行为,我们引入了一种新颖的评估协议来衡量准确性与效率之间的权衡。利用这一框架,我们证明:尽管最先进的智能体在原始准确率上能够匹配人类搜索者,但它们主要在完全不同的问题上取得成功,并依赖暴力搜索来弥补策略规划的薄弱环节。它们无法缩小与最优性能之间近20%的差距,持续陷入无效循环。我们发布数据集和评估工具,以帮助推动从暴力检索向校准化、高效推理的转变。