Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are sufficiently capable for agentic workflows. However, while smaller agents can closely match larger ones on simple tasks, it remains unclear how their performance scales with task complexity, when large models become necessary, and how to better leverage small agents for long-horizon workloads. In this work, we empirically show that small agents' performance fails to scale with task complexity on deep search and coding tasks, and we introduce Strategy Auctions for Workload Efficiency (SALE), an agent framework inspired by freelancer marketplaces. In SALE, agents bid with short strategic plans, which are scored by a systematic cost-value mechanism and refined via a shared auction memory, enabling per-task routing and continual self-improvement without training a separate router or running all models to completion. Across deep search and coding tasks of varying complexity, SALE reduces reliance on the largest agent by 52%, lowers overall cost by 35%, and consistently improves upon the largest agent's pass@1 with only a negligible overhead beyond executing the final trace. In contrast, established routers that rely on task descriptions either underperform the largest agent or fail to reduce cost, often both, underscoring their poor fit for agentic workflows. These results suggest that while small agents may be insufficient for complex workloads, they can be effectively "scaled up" through coordinated task allocation and test-time self-improvement. More broadly, they motivate a systems-level view of agentic AI in which performance gains come less from ever-larger individual models and more from market-inspired coordination mechanisms that organize heterogeneous agents into efficient, adaptive ecosystems.
翻译:小型语言模型正日益被视为一种有前景且经济高效的智能体AI方法,支持者声称它们足以胜任智能体工作流。然而,尽管小型智能体在简单任务上能与大型智能体相媲美,但其性能如何随任务复杂度扩展、何时需要大型模型,以及如何更好地利用小型智能体处理长周期工作负载,这些问题仍不明确。本研究通过实验表明,小型智能体的性能在深度搜索和编码任务上无法随任务复杂度扩展,并提出了面向工作负载效率的策略拍卖框架(SALE)。SALE受自由职业者市场启发,智能体通过提交简短策略计划进行竞标,这些计划由系统性的成本-价值机制评分,并通过共享拍卖记忆进行优化,从而实现任务级路由和持续自我改进,无需训练独立路由模块或执行所有完整模型。在复杂度各异的深度搜索和编码任务中,SALE将最大型智能体的使用量减少52%,总成本降低35%,并在仅增加微不足道的执行最终轨迹开销的情况下,持续提升大型智能体的pass@1指标。相比之下,依赖任务描述的现有路由方法要么性能劣于大型智能体,要么无法降低成本,甚至两者兼有,凸显了其对智能体工作流的不适用性。这些结果表明,尽管小型智能体可能不足以应对复杂工作负载,但通过协调的任务分配和测试时自我改进,它们可以被有效“扩展”。更广泛地看,这推动了智能体AI的系统层面视角——性能提升并非主要来自日益庞大的个体模型,而是源于类似市场的协调机制,将异构智能体组织成高效、自适应的生态系统。