When researchers deploy large language models for autonomous tasks like reviewing literature or generating hypotheses, the computational bills add up quickly. A single research session using a 70-billion parameter model can cost around $127 in cloud fees, putting these tools out of reach for many academic labs. We developed AgentCompress to tackle this problem head-on. The core idea came from a simple observation during our own work: writing a novel hypothesis clearly demands more from the model than reformatting a bibliography. Why should both tasks run at full precision? Our system uses a small neural network to gauge how hard each incoming task will be, based only on its opening words, then routes it to a suitably compressed model variant. The decision happens in under a millisecond. Testing across 500 research workflows in four scientific fields, we cut compute costs by 68.3% while keeping 96.2% of the original success rate. For labs watching their budgets, this could mean the difference between running experiments and sitting on the sidelines
翻译:当研究人员将大型语言模型部署于文献综述或假设生成等自主任务时,计算费用会迅速累积。使用700亿参数模型进行一次研究会话的云服务费用约为127美元,这使得许多学术实验室难以承担这些工具的成本。我们开发了AgentCompress系统以直接应对此问题。核心思路源于我们实际工作中的简单观察:撰写新假设显然比重新格式化参考文献需要模型更强的能力。为何两种任务都要以全精度运行?我们的系统使用小型神经网络,仅根据任务起始词汇评估每个输入任务的难度,随后将其路由至经过适当压缩的模型变体。该决策过程可在1毫秒内完成。通过在四个科学领域的500个研究工作流中进行测试,我们在保持96.2%原始成功率的同时,将计算成本降低了68.3%。对于预算有限的实验室而言,这可能意味着能够开展实验而非被迫停滞的实质性区别。