This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game. We identify four inference-time levers that shape performance: model selection, policies and guardrails, centralized data sharing, and prompt engineering. Model capability is the dominant factor: an out-of-the-box reasoning model exceeds human-level performance, and optimized reasoning models reduce costs by up to 67% relative to human teams. However, strong average performance masks substantial reliability risks. We introduce agent bullwhip: the amplification of run-to-run decision instability in autonomous multi-echelon systems. A central component is decision bullwhip, the portion of order variability generated by stochastic agent decisions rather than by changes in customer demand. We show that decision instability can amplify both across facilities at a fixed point in time and within the same facility over time, even when the demand path is held fixed. Repeated sampling, a natural test-time remedy, fails to meaningfully reduce this instability, suggesting that reliability requires changing the underlying decision policy rather than merely averaging over model outputs. To address this limitation, we propose a Group Relative Policy Optimization (GRPO)-based reinforcement-learning post-training framework that trains a shared base LLM using system-level supply-chain rewards. Post-training substantially reduces tail events, curtails agent bullwhip, and improves the reliability of autonomous supply-chain agents.
翻译:本文以MIT啤酒游戏为场景,研究多级供应链中的自主生成式AI代理。我们识别出四种影响性能的推理时杠杆:模型选择、策略与防护机制、集中式数据共享以及提示工程。模型能力是主导因素:开箱即用的推理模型超越人类水平性能,而优化后的推理模型相较人类团队可降低高达67%的成本。然而,强劲的平均性能掩盖了显著的可靠性风险。我们提出"代理牛鞭效应"概念:自主多级系统中决策不稳定性的逐次放大效应。其核心组成部分为"决策牛鞭效应",即订单波动中由随机代理决策(而非客户需求变化)产生的部分。研究表明,即使在固定需求路径下,决策不稳定性既能在同一时间点跨设施间放大,也能在相同设施内随时间推移累积。试错法中的自然手段——重复采样未能有效降低此类不稳定性,表明可靠性需要改变底层决策策略而非简单平均模型输出。针对该局限,我们提出基于群体相对策略优化(GRPO)的强化学习后训练框架,通过系统级供应链奖励训练共享基础大语言模型(LLM)。后训练显著降低尾部事件、抑制代理牛鞭效应,并提升自主供应链代理的可靠性。