The agent-environment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system performance. With application domains ranging from warehouse logistics to windfarm management, co-design promises to fundamentally change how we deploy multi-agent systems. However, current co-design methods struggle to scale. They collapse under high-dimensional environment design spaces and suffer from sample inefficiency when addressing moving targets inherent to joint optimisation. We address these challenges by developing Diffusion Co-Design (DiCoDe), a scalable and sample-efficient co-design framework pushing co-design towards practically relevant settings. DiCoDe incorporates two core innovations. First, we introduce Projected Universal Guidance (PUG), a sampling technique that enables DiCoDe to explore a distribution of reward-maximising environments while satisfying hard constraints such as spatial separation between obstacles. Second, we devise a critic distillation mechanism to share knowledge from the reinforcement learning critic, ensuring that the guided diffusion model adapts to evolving agent policies using a dense and up-to-date learning signal. Together, these improvements lead to superior environment-policy pairs when validated on challenging multi-agent environment co-design benchmarks including warehouse automation, multi-agent pathfinding and wind farm optimisation. Our method consistently exceeds the state-of-the-art, achieving, for example, 39% higher rewards in the warehouse setting with 66% fewer simulation samples. This sets a new standard in agent-environment co-design, and is a stepping stone towards reaping the rewards of co-design in real world domains.
翻译:智能体-环境协同设计范式通过联合优化智能体策略与环境配置,以提升系统整体性能。该范式在仓库物流、风电场管理等应用领域展现出变革多智能体系统部署方式的潜力。然而,现有协同设计方法面临可扩展性瓶颈:在高维环境设计空间中易失效,且在应对联合优化固有的动态目标时样本效率低下。为应对这些挑战,我们提出了扩散协同设计(DiCoDe),一种可扩展且样本高效的协同设计框架,推动该范式向实际应用场景迈进。DiCoDe包含两项核心创新:首先,我们提出投影式通用引导(PUG)采样技术,使DiCoDe能够在满足障碍物空间分离等硬约束条件下,探索奖励最大化环境的概率分布;其次,我们设计了评论家蒸馏机制,通过共享强化学习评论家的知识,确保引导扩散模型能够利用密集且实时更新的学习信号适应持续演化的智能体策略。在仓库自动化、多智能体路径规划及风电场优化等具有挑战性的多智能体环境协同设计基准测试中,这些改进共同促成了更优的环境-策略组合。我们的方法持续超越现有最优技术,例如在仓库场景中以66%更少的仿真样本实现了39%的奖励提升。这为智能体-环境协同设计确立了新标准,并为在现实领域实现协同设计的效益奠定了关键基石。