As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment, economic systems are inherently multi-agent, requiring autonomous agents to communicate, negotiate, and transact while pursuing their own objectives over extended periods. We introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior reveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude~Haiku~4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans. We release our code and agent trajectories to support future research.
翻译:随着LLM智能体能够处理越来越长时域的任务,评估其在经济系统中的表现变得日益重要。与现有主要评估单一智能体与被动环境交互的基准测试不同,经济系统本质上具有多智能体特性,要求自主智能体在追求自身目标的长期过程中,进行通信、协商和交易。我们引入CoffeeBench,这是一个用于评估由异构企业组成的长时域多智能体经济中LLM智能体表现的基准测试。在CoffeeBench中,两位农民、两位烘焙师和两位零售商在90天的模拟中自主经营其业务,每位智能体通过通信和交易管理现金、库存和定价,以最大化累积净收入。被评估模型控制一家咖啡烘焙商,而其余企业则由固定参考智能体控制。在多个近期发布的开源权重和专有LLM中,所有模型均优于不采取任何行动的被动基线模型,大多数实现了正净收入。对智能体行为的分析揭示了长时域经济交互中的显著差异:性能较高的模型与其他企业的通信更为积极,而Claude Haiku 4.5则表现出空闲漂移的失败模式,尽管能生成连贯的评估和计划,但仍反复选择不作为。我们公开发布代码和智能体轨迹,以支持未来研究。