Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are sufficiently capable for agentic workflows. However, while smaller agents can closely match larger ones on simple tasks, it remains unclear how their performance scales with task complexity, when large models become necessary, and how to better leverage small agents for long-horizon workloads. In this work, we empirically show that small agents' performance fails to scale with task complexity on deep search and coding tasks, and we introduce Strategy Auctions for Workload Efficiency (SALE), an agent framework inspired by freelancer marketplaces. In SALE, agents bid with short strategic plans, which are scored by a systematic cost-value mechanism and refined via a shared auction memory, enabling per-task routing and continual self-improvement without training a separate router or running all models to completion. Across deep search and coding tasks of varying complexity, SALE reduces reliance on the largest agent by 53%, lowers overall cost by 35%, and consistently improves upon the largest agent's pass@1 with only a negligible overhead beyond executing the final trace. In contrast, established routers that rely on task descriptions either underperform the largest agent or fail to reduce cost -- often both -- underscoring their poor fit for agentic workflows. These results suggest that while small agents may be insufficient for complex workloads, they can be effectively "scaled up" through coordinated task allocation and test-time self-improvement. More broadly, they motivate a systems-level view of agentic AI in which performance gains come less from ever-larger individual models and more from market-inspired coordination mechanisms that organize heterogeneous agents into efficient, adaptive ecosystems.
翻译:小型语言模型日益被视为一种具有前景且成本效益高的智能体人工智能方法,其支持者声称它们已足够胜任智能体工作流。然而,尽管小型智能体在简单任务上能与大型智能体表现接近,但其性能如何随任务复杂性扩展、何时需要大型模型以及如何更好地利用小型智能体处理长周期工作负载等问题仍不明确。在本研究中,我们通过实验证明,在深度搜索和编码任务上,小型智能体的性能无法随任务复杂性有效扩展,并提出了受自由职业者市场启发的智能体框架——工作负载效率策略拍卖(SALE)。在SALE框架中,智能体通过简短策略计划进行投标,这些计划由系统化的成本-价值机制评分,并通过共享拍卖记忆进行优化,从而实现按任务路由和持续自我改进,而无需训练独立的路由器或运行所有模型至完成。在不同复杂度的深度搜索和编码任务中,SALE将对最大型智能体的依赖降低了53%,总体成本减少了35%,并且仅需执行最终轨迹之外的微不足道开销,就能持续超越最大型智能体的pass@1表现。相比之下,依赖任务描述的现有路由器要么表现逊于最大型智能体,要么无法降低成本——通常两者兼有——这凸显了它们对智能体工作流的适配性不足。这些结果表明,虽然小型智能体可能不足以应对复杂工作负载,但通过协调的任务分配和测试时自我改进,可以有效“扩展”其能力。更广泛而言,这些发现倡导从系统层面审视智能体人工智能,其中性能提升不再仅仅依赖于日益增大的单个模型,而更多源自受市场启发的协调机制,这些机制将异构智能体组织成高效、自适应的生态系统。