Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes the trust problem concrete: an assistant must coordinate with other assistants while deciding what to reveal about the person it represents. We introduce CalBench, a controlled benchmark for multi-agent calendar scheduling under private information. In each task, $N$ agents manage separate private calendars and schedule a stream of $M$ incoming meetings while minimizing disruption costs. Because no agent can inspect another agent's calendar, success requires language-mediated coordination rather than centralized planning. CalBench generates solvable scenarios with CP-SAT oracle solutions and decentralized non-LLM reference protocols, enabling evaluation of task success, excess cost, communication efficiency, burden fairness, and privacy leakage under matched information constraints. Across seven model families, we find that completion alone misses important failures: agents leave avoidable cost on the table, communication volume does not predict lower regret, and privacy-preserving silence can deprive teammates of cost information needed for fair burden allocation. CalBench provides a reproducible testbed for studying whether autonomous assistants can coordinate on behalf of users before deployment at scale.
翻译:个人AI助手正逐渐成为能够访问日历、收件箱和用户偏好的代理人。日历调度使信任问题变得具体:一个助手必须与其他助手协调,同时决定代表其用户披露哪些信息。我们提出了CalBench,一个在私有信息条件下进行多智能体日历调度的受控基准测试。在每个任务中,N个智能体各自管理独立的私有日历,并调度M个传入的会议流,同时最小化干扰成本。由于没有智能体可以查看其他智能体的日历,成功要求以语言为媒介的协调,而非集中式规划。CalBench通过CP-SAT最优解和去中心化的非大语言模型参考协议生成可解场景,从而能够在匹配的信息约束下评估任务成功率、超额成本、通信效率、负担公平性和隐私泄露。在七个模型系列中,我们发现仅凭完成率会遗漏重要失败:智能体会留下可避免的成本,通信量并不能预测更低的遗憾值,而保护隐私的沉默可能剥夺队友公平分配负担所需的成本信息。CalBench为研究自主助手在大规模部署之前能否代表用户进行协调提供了一个可复现的测试平台。