The optimization of urban energy systems is crucial for the advancement of sustainable and resilient smart cities, which are becoming increasingly complex with multiple decision-making units. To address scalability and coordination concerns, Multi-Agent Reinforcement Learning (MARL) is a promising solution. This paper addresses the imperative need for comprehensive and reliable benchmarking of MARL algorithms on energy management tasks. CityLearn is used as a case study environment because it realistically simulates urban energy systems, incorporates multiple storage systems, and utilizes renewable energy sources. By doing so, our work sets a new standard for evaluation, conducting a comparative study across multiple key performance indicators (KPIs). This approach illuminates the key strengths and weaknesses of various algorithms, moving beyond traditional KPI averaging which often masks critical insights. Our experiments utilize widely accepted baselines such as Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC), and encompass diverse training schemes including Decentralized Training with Decentralized Execution (DTDE) and Centralized Training with Decentralized Execution (CTDE) approaches and different neural network architectures. Our work also proposes novel KPIs that tackle real world implementation challenges such as individual building contribution and battery storage lifetime. Our findings show that DTDE consistently outperforms CTDE in both average and worst-case performance. Additionally, temporal dependency learning improved control on memory dependent KPIs such as ramping and battery usage, contributing to more sustainable battery operation. Results also reveal robustness to agent or resource removal, highlighting both the resilience and decentralizability of the learned policies.
翻译:城市能源系统的优化对于推进可持续且具有韧性的智慧城市发展至关重要,这些系统随着多决策单元的引入而日益复杂。为解决可扩展性与协调性问题,多智能体强化学习(MARL)是一种前景广阔的解决方案。本文针对能源管理任务中MARL算法缺乏全面可靠基准测试的迫切需求展开研究。CityLearn被选为案例研究环境,因其能够真实模拟城市能源系统、整合多种储能系统并利用可再生能源。通过这一研究,我们的工作建立了新的评估标准,在多个关键性能指标(KPI)上进行了比较性研究。该方法揭示了各类算法的核心优势与不足,突破了传统KPI平均化方法往往掩盖关键洞察的局限。我们的实验采用广泛认可的基线算法,如近端策略优化(PPO)和柔性演员-评论家(SAC),并涵盖多种训练方案,包括分散训练分散执行(DTDE)、集中训练分散执行(CTDE)方法以及不同的神经网络架构。本研究还提出了新颖的KPI指标,以应对实际部署中的挑战,如单体建筑贡献度与电池储能寿命评估。实验结果表明,DTDE在平均性能与最差情况性能上均持续优于CTDE。此外,时序依赖性学习提升了对记忆依赖型KPI(如功率爬坡率与电池使用模式)的控制能力,有助于实现更可持续的电池运行。结果还显示出对智能体或资源移除的鲁棒性,凸显了所学策略的韧性及可分散化特性。