This report studies whether small, tool-augmented agents can match or outperform larger monolithic models on the GAIA benchmark. Using Qwen3 models (4B-32B) within an adapted Agentic-Reasoning framework, we isolate the effects of model scale, explicit thinking (no thinking, planner-only, or full), and tool use (search, code, mind-map). Tool augmentation provides the largest and most consistent gains. Using tools, 4B models can outperform 32B models without tool access on GAIA in our experimental setup. In contrast, explicit thinking is highly configuration- and difficulty-dependent: planner-only thinking can improve decomposition and constraint tracking, while unrestricted full thinking often degrades performance by destabilizing tool orchestration, leading to skipped verification steps, excessive tool calls, non-termination, and output-format drift.
翻译:本报告研究在GAIA基准测试中,配备工具增强的小型智能体是否能够匹配或超越更大的单体模型。通过在使用适配的Agentic-Reasoning框架内的Qwen3模型(4B-32B),我们分离了模型规模、显式思维(无思维、仅规划器或完整思维)以及工具使用(搜索、代码、思维导图)的影响。工具增强带来了最大且最一致的性能提升。在我们的实验设置中,使用工具的4B模型在GAIA上能够超越无法访问工具的32B模型。相比之下,显式思维高度依赖于配置和任务难度:仅规划器思维可以改善任务分解和约束跟踪,而无限制的完整思维通常会通过破坏工具编排的稳定性而降低性能,导致验证步骤被跳过、工具调用过多、无法终止以及输出格式漂移。