Large language model (LLM)-based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi-agent buyer-seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product-dependent valuations, and must reach agreements through multi-round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many-to-many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state-of-the-art proprietary and open-weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long-horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language-based market interaction. Code and dataset are available at the link: https://github.com/SafeRL-Lab/AgenticPay.
翻译:基于大语言模型(LLM)的智能体被日益期望能够自主进行协商、协调与交易,然而现有基准测试缺乏用于评估多智能体间语言驱动经济交互的原则性设定。本文提出AgenticPay——一个面向自然语言驱动的多智能体买卖协商的基准测试与仿真框架。AgenticPay对买卖双方拥有私有约束和产品依赖估值的市场进行建模,要求智能体必须通过多轮语言协商(而非仅依靠数值竞价)达成协议。该框架支持涵盖双边议价到多对多市场的110余项多样化任务,并提供结构化行为提取及可行性、效率与福利等评估指标。通过对前沿闭源与开源权重LLM的基准测试,本研究揭示了协商性能存在的显著差距,凸显了长程战略推理面临的挑战,从而确立AgenticPay作为研究智能体商务与基于语言的市场交互的基础平台。代码与数据集可通过以下链接获取:https://github.com/SafeRL-Lab/AgenticPay。