We present TickingCollabBench, a Minecraft-based multi-agent benchmark for a novel class of time-sensitive complementary collaboration tasks. Our benchmark reflects four core characteristics of real-world collaboration: agent heterogeneity, mandatory collaboration, dynamic environments, and strict real-time constraints with failure risks. To enable this, we develop the TickingCollab framework, which supports the generation of diverse dynamic environments and abstracts Minecraft's primitive APIs to enable declarative YAML task specifications for composing these events. Building on this, we design a feasibility-aware automated benchmark generation pipeline, where an LLM drafts structurally diverse task configurations and feasibility verifier filters out invalid ones using approximate constraints. Evaluations demonstrate that lang latency and inherent difficulty of coordinating under partial observability and agent heterogeneity cause LLMs to frequently fail under dynamic environments and fall significantly short of a global-knowledge oracle.
翻译:我们提出了TickingCollabBench,一个基于Minecraft的新型时序互补协作任务多智能体基准测试。该基准测试体现了现实世界协作的四个核心特征:智能体异质性、强制性协作、动态环境以及具有失败风险的严格实时约束。为实现这一目标,我们开发了TickingCollab框架,该框架支持生成多样化的动态环境,并抽象了Minecraft的原始API,使得能够通过声明式YAML任务规范来组合这些事件。在此基础上,我们设计了一个可行性感知的自动化基准测试生成流程,其中大型语言模型(LLM)起草结构多样的任务配置,而可行性验证器则利用近似约束过滤掉无效配置。评估表明,在部分可观测性和智能体异质性条件下,语言延迟与协调的固有难度导致LLM在动态环境中频繁失败,其性能远不及全局知识型先知模型。