Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale-state errors, and high inference cost. We study this problem in automated expense itemization in Microsoft Dynamics 365 Finance and Operations using Model Context Protocol tools. We evaluate four GPT-5 configurations on a 50-task hotel expense benchmark: no user model, full conversation history, context pruned to the last 5 tool call/response pairs, and pruning with automated summarization. Results are averaged across 5 independent runs, with the user model held constant for the context-engineering comparison. The no-user-model baseline achieves only 8.0% complete itemization. Full-context retention improves completion to 71.0%, but consumes 1,480,996 tokens and 14.56 hours per benchmark. Pruning to the last 5 tool calls improves completion to 79.0% while reducing token use to 535,274 and runtime to 5.39 hours. Adding summarization achieves the best result: 91.6% complete itemization and 99.64% average amount itemized, with 553,374 tokens and 5.79 hours. We further report confidence intervals, effect-size analysis, sensitivity over pruning and summary windows, failure analysis, results across five expense types grouped into three categories, and cross-model evidence with Claude Sonnet 4.5. These results show that, for this class of enterprise tool-use workflow, selective retention of recent tool interactions plus compact summarization can improve both reliability and efficiency compared with full-history retention.

翻译：在将大型语言模型部署为面向企业工作流的自主智能体时，面临一项关键挑战：来自企业系统的冗长工具响应可能导致上下文溢出、状态错误以及高昂的推理成本。本文基于微软 Dynamics 365 财务与运营系统中的模型上下文协议工具，针对自动化费用明细化任务研究此问题。我们在一个包含50项任务的酒店费用基准测试上评估了四种GPT-5配置：无用户模型、完整对话历史、上下文裁剪至最后5个工具调用/响应对，以及结合自动摘要的裁剪方案。结果基于5次独立运行的平均值，其中用户模型保持不变以进行上下文工程对比。无用户模型基线仅达成8.0%的完全费用明细化。保留完整上下文将完成率提升至71.0%，但每次基准测试消耗1,480,996个token和14.56小时。将上下文裁剪至最后5个工具调用后，完成率提升至79.0%，同时token使用量降至535,274个，运行时间缩短至5.39小时。加入摘要功能后取得最佳结果：完全费用明细化率达到91.6%，平均费用明细化金额达99.64%，消耗553,374个token和5.79小时。我们进一步报告了置信区间、效应量分析、对裁剪窗口和摘要窗口的敏感性分析、失败案例分析、三个类别下五种费用类型的结果，以及基于Claude Sonnet 4.5的跨模型验证。这些结果表明，对于此类企业工具调用工作流，选择性保留近期工具交互并配合紧凑摘要，可在可靠性和效率上优于保留完整历史的方法。