Modeling of real-world biological multi-agents is a fundamental problem in various scientific and engineering fields. Reinforcement learning (RL) is a powerful framework to generate flexible and diverse behaviors in cyberspace; however, when modeling real-world biological multi-agents, there is a domain gap between behaviors in the source (i.e., real-world data) and the target (i.e., cyberspace for RL), and the source environment parameters are usually unknown. In this paper, we propose a method for adaptive action supervision in RL from real-world demonstrations in multi-agent scenarios. We adopt an approach that combines RL and supervised learning by selecting actions of demonstrations in RL based on the minimum distance of dynamic time warping for utilizing the information of the unknown source dynamics. This approach can be easily applied to many existing neural network architectures and provide us with an RL model balanced between reproducibility as imitation and generalization ability to obtain rewards in cyberspace. In the experiments, using chase-and-escape and football tasks with the different dynamics between the unknown source and target environments, we show that our approach achieved a balance between the reproducibility and the generalization ability compared with the baselines. In particular, we used the tracking data of professional football players as expert demonstrations in football and show successful performances despite the larger gap between behaviors in the source and target environments than the chase-and-escape task.
翻译:对真实世界生物多智能体的建模是众多科学与工程领域的基础问题。强化学习(RL)是生成灵活多样网络空间行为的强大框架;然而,在建模真实世界生物多智能体时,源域(即真实世界数据)与目标域(即RL网络空间)的行为存在领域差距,且源环境参数通常未知。本文提出一种从真实世界多智能体演示中进行自适应动作监督的强化学习方法。我们采用结合RL与监督学习的策略,基于动态时间规整的最小距离选取演示中的动作,以利用未知源动态信息。该方法可轻松适配现有多种神经网络架构,并为我们提供一种在复现性(即模仿)与泛化能力(即在网络空间获取奖励)之间取得平衡的RL模型。在实验中,我们使用具有不同未知源动态和目标动态的追逐-躲避任务与足球任务,验证了该方法相比基线实现了复现性与泛化能力的平衡。特别地,我们将职业足球运动员的追踪数据作为专家演示应用于足球任务,尽管该任务中源域与目标域行为差距大于追逐-躲避任务,但方法仍展现出成功性能。