Scaling Multi-Agent Environment Co-Design with Diffusion Models

The agent-environment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system performance. With application domains ranging from warehouse logistics to windfarm management, co-design promises to fundamentally change how we deploy multi-agent systems. However, current co-design methods struggle to scale. They collapse under high-dimensional environment design spaces and suffer from sample inefficiency when addressing moving targets inherent to joint optimisation. We address these challenges by developing Diffusion Co-Design (DiCoDe), a scalable and sample-efficient co-design framework pushing co-design towards practically relevant settings. DiCoDe incorporates two core innovations. First, we introduce Projected Universal Guidance (PUG), a sampling technique that enables DiCoDe to explore a distribution of reward-maximising environments while satisfying hard constraints such as spatial separation between obstacles. Second, we devise a critic distillation mechanism to share knowledge from the reinforcement learning critic, ensuring that the guided diffusion model adapts to evolving agent policies using a dense and up-to-date learning signal. Together, these improvements lead to superior environment-policy pairs when validated on challenging multi-agent environment co-design benchmarks including warehouse automation, multi-agent pathfinding and wind farm optimisation. Our method consistently exceeds the state-of-the-art, achieving, for example, 39% higher rewards in the warehouse setting with 66% fewer simulation samples. This sets a new standard in agent-environment co-design, and is a stepping stone towards reaping the rewards of co-design in real world domains.

翻译：智能体-环境协同设计范式通过联合优化智能体策略与环境配置，致力于提升系统整体性能。从仓储物流到风电场管理等多个应用领域，该范式有望从根本上改变多智能体系统的部署方式。然而，现有协同设计方法在规模化扩展方面存在瓶颈：在高维环境设计空间中效率低下，且难以应对联合优化中固有的动态目标引发的样本效率问题。为突破这些局限，我们提出扩散协同设计框架DiCoDe，这是一种兼具可扩展性与样本高效性的协同设计方案，推动该技术向实际应用场景迈进。DiCoDe包含两项核心创新：其一，引入投影通用引导（PUG）采样技术，使DiCoDe能够在满足障碍物间距等硬约束的前提下，探索奖励最大化环境的分布空间；其二，设计评论家知识蒸馏机制，通过强化学习评论家传递密集且实时更新的学习信号，确保引导扩散模型能够适应持续演化的智能体策略。在包含仓储自动化、多智能体路径规划与风电场优化等挑战性多智能体环境协同设计基准测试中，上述改进使得DiCoDe能够生成更优的环境-策略组合。我们的方法持续超越现有最优技术，例如在仓储场景中以减少66%仿真样本的代价实现39%的奖励提升。这为智能体-环境协同设计树立了新标杆，标志着向现实世界领域释放协同设计潜力迈出关键一步。