In multi-timescale multi-agent reinforcement learning (MARL), agents interact across different timescales. In general, policies for time-dependent behaviors, such as those induced by multiple timescales, are non-stationary. Learning non-stationary policies is challenging and typically requires sophisticated or inefficient algorithms. Motivated by the prevalence of this control problem in real-world complex systems, we introduce a simple framework for learning non-stationary policies for multi-timescale MARL. Our approach uses available information about agent timescales to define a periodic time encoding. In detail, we theoretically demonstrate that the effects of non-stationarity introduced by multiple timescales can be learned by a periodic multi-agent policy. To learn such policies, we propose a policy gradient algorithm that parameterizes the actor and critic with phase-functioned neural networks, which provide an inductive bias for periodicity. The framework's ability to effectively learn multi-timescale policies is validated on a gridworld and building energy management environment.
翻译:在多时间尺度多智能体强化学习(MARL)中,智能体在不同时间尺度上进行交互。通常,具有时间依赖行为的策略(例如由多时间尺度诱导的策略)是非平稳的。学习非平稳策略具有挑战性,通常需要复杂或低效的算法。受现实复杂系统中该控制问题普遍存在的启发,我们提出了一种用于学习多时间尺度MARL非平稳策略的简单框架。我们的方法利用智能体时间尺度的可用信息来定义周期性时间编码。具体而言,我们从理论上证明,由多时间尺度引入的非平稳性效应可以通过周期性多智能体策略来学习。为学习此类策略,我们提出了一种策略梯度算法,该算法使用相位函数神经网络对行动者和评论者进行参数化,从而引入周期性的归纳偏置。该框架在网格世界和建筑能源管理环境中验证了其有效学习多时间尺度策略的能力。