Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes the trust problem concrete: an assistant must coordinate with other assistants while deciding what to reveal about the person it represents. We introduce CalBench, a controlled benchmark for multi-agent calendar scheduling under private information. In each task, $N$ agents manage separate private calendars and schedule a stream of $M$ incoming meetings while minimizing disruption costs. Because no agent can inspect another agent's calendar, success requires language-mediated coordination rather than centralized planning. CalBench generates solvable scenarios with CP-SAT oracle solutions and decentralized non-LLM reference protocols, enabling evaluation of task success, excess cost, communication efficiency, burden fairness, and privacy leakage under matched information constraints. Across seven model families, we find that completion alone misses important failures: agents leave avoidable cost on the table, communication volume does not predict lower regret, and privacy-preserving silence can deprive teammates of cost information needed for fair burden allocation. CalBench provides a reproducible testbed for studying whether autonomous assistants can coordinate on behalf of users before deployment at scale.
翻译:个人AI助手正开始扮演代表用户的代理角色,能够访问日历、收件箱及用户偏好。日历调度使信任问题具体化:助手在决定透露其代表用户的信息时,必须与其他助手进行协调。我们提出了CalBench,一个面向私有信息下多智能体日历调度的受控基准测试。在每项任务中,$N$个智能体分别管理各自的私有日历,并安排$M$个即将到来的会议,同时最小化干扰成本。由于没有智能体可以查看其他智能体的日历,成功执行任务需要基于语言的协调而非集中式规划。CalBench通过CP-SAT精确解和去中心化非大语言模型参考协议生成可解场景,从而能够在匹配信息约束条件下评估任务成功率、超额成本、通信效率、负担公平性及隐私泄露。在七个模型系列上,我们发现仅靠完成率指标会遗漏重要失败情形:智能体留下了可避免的成本,通信量并不能预测更低的遗憾值,而保护隐私的沉默可能导致队友缺乏公平负担分配所需的成本信息。CalBench为研究自主助手在大规模部署前能否代表用户进行协调提供了可复现的测试平台。