Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale-state errors, and high inference cost. We study this problem in automated expense itemization in Microsoft Dynamics 365 Finance and Operations using Model Context Protocol tools. We evaluate four GPT-5 configurations on a 50-task hotel expense benchmark: no user model, full conversation history, context pruned to the last 5 tool call/response pairs, and pruning with automated summarization. Results are averaged across 5 independent runs, with the user model held constant for the context-engineering comparison. The no-user-model baseline achieves only 8.0% complete itemization. Full-context retention improves completion to 71.0%, but consumes 1,480,996 tokens and 14.56 hours per benchmark. Pruning to the last 5 tool calls improves completion to 79.0% while reducing token use to 535,274 and runtime to 5.39 hours. Adding summarization achieves the best result: 91.6% complete itemization and 99.64% average amount itemized, with 553,374 tokens and 5.79 hours. We further report confidence intervals, effect-size analysis, sensitivity over pruning and summary windows, failure analysis, results across five expense types grouped into three categories, and cross-model evidence with Claude Sonnet 4.5. These results show that, for this class of enterprise tool-use workflow, selective retention of recent tool interactions plus compact summarization can improve both reliability and efficiency compared with full-history retention.
翻译:在将大型语言模型部署为面向企业工作流的自主智能体时,面临一项关键挑战:来自企业系统的冗长工具响应可能导致上下文溢出、状态错误以及高昂的推理成本。本文基于微软 Dynamics 365 财务与运营系统中的模型上下文协议工具,针对自动化费用明细化任务研究此问题。我们在一个包含50项任务的酒店费用基准测试上评估了四种GPT-5配置:无用户模型、完整对话历史、上下文裁剪至最后5个工具调用/响应对,以及结合自动摘要的裁剪方案。结果基于5次独立运行的平均值,其中用户模型保持不变以进行上下文工程对比。无用户模型基线仅达成8.0%的完全费用明细化。保留完整上下文将完成率提升至71.0%,但每次基准测试消耗1,480,996个token和14.56小时。将上下文裁剪至最后5个工具调用后,完成率提升至79.0%,同时token使用量降至535,274个,运行时间缩短至5.39小时。加入摘要功能后取得最佳结果:完全费用明细化率达到91.6%,平均费用明细化金额达99.64%,消耗553,374个token和5.79小时。我们进一步报告了置信区间、效应量分析、对裁剪窗口和摘要窗口的敏感性分析、失败案例分析、三个类别下五种费用类型的结果,以及基于Claude Sonnet 4.5的跨模型验证。这些结果表明,对于此类企业工具调用工作流,选择性保留近期工具交互并配合紧凑摘要,可在可靠性和效率上优于保留完整历史的方法。