Large language models hold considerable promise for various applications, but their computational requirements create a barrier that many institutions cannot overcome. A single session using a 70-billion-parameter model can cost around $127 in cloud computing fees, which puts these tools out of reach for organizations operating on limited budgets. We present AgentCompress, a framework that tackles this problem through task-aware dynamic compression. The idea comes from a simple observation: not all tasks require the same computational effort. Complex reasoning, for example, is far more demanding than text reformatting, yet conventional compression applies the same reduction to both. Our approach uses a lightweight neural controller that looks at the first few tokens of each request, estimates how complex the task will be, and sends it to an appropriately quantized version of the model. This routing step adds only about 12 milliseconds of overhead. We tested the framework on 290 multi-stage workflows from domains including computer science, physics, chemistry, and biology. The results show a 68.3% reduction in computational costs while preserving 96.2% of the original success rate. These findings suggest that routing queries intelligently can make powerful language models substantially more affordable without sacrificing output quality
翻译:大型语言模型在各类应用中展现出巨大潜力,但其计算需求构成了许多机构难以跨越的门槛。使用700亿参数模型进行单次会话的云计算费用约为127美元,这使得预算有限的组织无法触及这些工具。本文提出AgentCompress框架,通过任务感知的动态压缩技术解决该问题。其核心思想源于一个简单观察:不同任务所需的计算量并不相同。例如复杂推理任务的计算需求远高于文本重构任务,而传统压缩方法却对两者采用相同的压缩策略。我们的方法采用轻量级神经控制器,通过分析请求的前几个标记来预估任务复杂度,并将其路由至相应量化版本的模型。该路由步骤仅增加约12毫秒的额外开销。我们在计算机科学、物理学、化学及生物学领域的290个多阶段工作流上测试了该框架。实验结果表明,在保持96.2%原始成功率的同时,计算成本降低了68.3%。这些发现表明,通过智能查询路由策略,能够在保持输出质量的前提下显著降低强大语言模型的使用成本。