We present a modular approach to \emph{reinforcement learning} (RL) in environments consisting of simpler components evolving in parallel. A monolithic view of such modular environments may be prohibitively large to learn, or may require unrealizable communication between the components in the form of a centralized controller. Our proposed approach is based on the assume-guarantee paradigm where the optimal control for the individual components is synthesized in isolation by making \emph{assumptions} about the behaviors of neighboring components, and providing \emph{guarantees} about their own behavior. We express these \emph{assume-guarantee contracts} as regular languages and provide automatic translations to scalar rewards to be used in RL. By combining local probabilities of satisfaction for each component, we provide a lower bound on the probability of satisfaction of the complete system. By solving a Markov game for each component, RL can produce a controller for each component that maximizes this lower bound. The controller utilizes the information it receives through communication, observations, and any knowledge of a coarse model of other agents. We experimentally demonstrate the efficiency of the proposed approach on a variety of case studies.
翻译:我们提出一种模块化方法,用于处理由并行演化的简单组件构成的环境中的强化学习(RL)。对此类模块化环境采用整体式视角可能导致学习量过大,或需要以集中控制器的形式在组件间实现不切实际的通信。本方法基于假设-保证范式:通过假设相邻组件的行为并保证自身行为,在隔离条件下为各组件综合最优控制策略。我们将这些"假设-保证契约"表示为正则语言,并自动转换为用于强化学习的标量奖励。通过组合各组件的局部满足概率,我们给出完整系统满足概率的下界。通过为每个组件求解马尔可夫博弈,强化学习可为每个组件生成最大化该下界的控制器。该控制器利用通信、观测及其对其他智能体粗粒度模型的认知所获取的信息。我们通过多种案例研究实验验证了所提方法的有效性。