Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
翻译:多模态智能体为实现复杂文档密集型工作流的自动化提供了前景广阔的路径。然而,一个关键问题依然存在:这些智能体展现的是真正的战略性推理,还是仅仅是随机的试错搜索?为探究此问题,我们引入了MADQA基准数据集,该数据集包含基于800份异构PDF文档构建的2250个人工撰写的问题。在经典测验理论的指导下,我们设计该基准以最大化对不同层次智能体能力的区分能力。为评估智能体行为,我们引入了一种新颖的评估协议,用于衡量准确性与努力程度之间的权衡。利用该框架,我们发现,尽管最佳智能体在原始准确率上可以匹敌人类搜索者,但它们成功解决的问题类型与人类存在显著差异,并且依赖暴力搜索来弥补其薄弱的战略规划能力。它们未能弥合与最优性能之间近20%的差距,并持续陷入无效循环。我们发布了该数据集及评估工具,以助力推动从暴力检索向经过校准的高效推理的转变。