The ability of Large Language Models (LLMs) to use external tools unlocks powerful real-world interactions, making rigorous evaluation essential. However, current benchmarks primarily report final accuracy, revealing what models can do but obscuring the cognitive bottlenecks that define their true capability boundaries. To move from simple performance scoring to a diagnostic tool, we introduce a framework grounded in Cognitive Load Theory. Our framework deconstructs task complexity into two quantifiable components: Intrinsic Load, the inherent structural complexity of the solution path, formalized with a novel Tool Interaction Graph; and Extraneous Load, the difficulty arising from ambiguous task presentation. To enable controlled experiments, we construct ToolLoad-Bench, the first benchmark with parametrically adjustable cognitive load. Our evaluation reveals distinct performance cliffs as cognitive load increases, allowing us to precisely map each model's capability boundary. We validate that our framework's predictions are highly calibrated with empirical results, establishing a principled methodology for understanding an agent's limits and a practical foundation for building more efficient systems.
翻译:大型语言模型(LLM)使用外部工具的能力开启了强大的现实世界交互,这使得严谨的评估变得至关重要。然而,当前的基准测试主要报告最终准确率,这揭示了模型能够做什么,却模糊了定义其真实能力边界的认知瓶颈。为了从简单的性能评分转向诊断工具,我们引入了一个基于认知负荷理论的框架。我们的框架将任务复杂性解构为两个可量化的组成部分:内在负荷,即解决方案路径固有的结构复杂性,通过一种新颖的工具交互图进行形式化;以及外在负荷,即由模糊的任务呈现方式所引起的难度。为了进行受控实验,我们构建了ToolLoad-Bench,这是第一个具有参数可调认知负荷的基准测试。我们的评估揭示了随着认知负荷增加而出现的明显性能陡降,使我们能够精确绘制每个模型的能力边界。我们验证了该框架的预测与实证结果高度吻合,从而建立了一种理解智能体极限的原则性方法,并为构建更高效系统奠定了实践基础。