Inventory control is a fundamental operations problem in which ordering decisions are traditionally guided by theoretically grounded operations research (OR) algorithms. However, such algorithms often rely on rigid modeling assumptions and can perform poorly when demand distributions shift or relevant contextual information is unavailable. Recent advances in large language models (LLMs) have generated interest in AI agents that can reason flexibly and incorporate rich contextual signals, but it remains unclear how best to incorporate LLM-based methods into traditional decision-making pipelines. We study how OR algorithms, LLMs, and humans can interact and complement each other in a multi-period inventory control setting. We construct InventoryBench, a benchmark of over 1,000 inventory instances spanning both synthetic and real-world demand data, designed to stress-test decision rules under demand shifts, seasonality, and uncertain lead times. Through this benchmark, we find that OR-augmented LLM methods outperform either method in isolation, suggesting that these methods are complementary rather than substitutes. We further investigate the role of humans through a controlled classroom experiment that embeds LLM recommendations into a human-in-the-loop decision pipeline. Contrary to prior findings that human-AI collaboration can degrade performance, we show that, on average, human-AI teams achieve higher profits than either humans or AI agents operating alone. Beyond this population-level finding, we formalize an individual-level complementarity effect and derive a distribution-free lower bound on the fraction of individuals who benefit from AI collaboration; empirically, we find this fraction to be substantial.
翻译:库存控制是运营管理中的基础性问题,传统上依赖理论完备的运筹学(OR)算法指导订购决策。然而,这类算法通常基于刚性的建模假设,当需求分布发生变化或相关情境信息缺失时,其性能可能显著下降。大语言模型(LLM)的最新进展催生了能够灵活推理并整合丰富情境信号的AI智能体,但如何将基于LLM的方法最佳地融入传统决策流程仍不明确。本研究探讨了在多周期库存控制场景中,运筹学算法、大语言模型与人类如何相互作用并实现互补。我们构建了InventoryBench基准测试集,涵盖合成与真实世界需求数据共1000余个库存实例,旨在对需求突变、季节性和不确定提前期等情境下的决策规则进行压力测试。通过该基准测试,我们发现:经运筹学增强的LLM方法优于任何单一方法,表明这些方法具有互补性而非替代关系。我们进一步通过受控课堂实验探究人类在决策流程中的作用,将LLM推荐嵌入人机协同决策链路。与先前关于人机协作可能降低性能的研究结论相反,本研究表明:平均而言,人机协作团队获得的利润高于人类或AI智能体单独决策的结果。除群体层面的发现外,我们形式化定义了个体层面的互补效应,并推导出受益于AI协作的个体比例的无分布下界;实证数据显示该比例具有显著规模。