Modeling of real-world biological multi-agents is a fundamental problem in various scientific and engineering fields. Reinforcement learning (RL) is a powerful framework to generate flexible and diverse behaviors in cyberspace; however, when modeling real-world biological multi-agents, there is a domain gap between behaviors in the source (i.e., real-world data) and the target (i.e., cyberspace for RL), and the source environment parameters are usually unknown. In this paper, we propose a method for adaptive action supervision in RL from real-world demonstrations in multi-agent scenarios. We adopt an approach that combines RL and supervised learning by selecting actions of demonstrations in RL based on the minimum distance of dynamic time warping for utilizing the information of the unknown source dynamics. This approach can be easily applied to many existing neural network architectures and provide us with an RL model balanced between reproducibility as imitation and generalization ability to obtain rewards in cyberspace. In the experiments, using chase-and-escape and football tasks with the different dynamics between the unknown source and target environments, we show that our approach achieved a balance between the reproducibility and the generalization ability compared with the baselines. In particular, we used the tracking data of professional football players as expert demonstrations in football and show successful performances despite the larger gap between behaviors in the source and target environments than the chase-and-escape task.
翻译:真实世界生物多智能体建模是众多科学与工程领域的基础问题。强化学习作为强大的框架,能够在虚拟空间中生成灵活多样的行为;然而,在对真实世界生物多智能体建模时,源域(即真实世界数据)与目标域(即强化学习所用的虚拟空间)之间存在行为差异,且源环境参数通常未知。本文提出一种方法,通过从多智能体场景的真实世界演示中实现强化学习中的自适应动作监督。我们采用结合强化学习与监督学习的方式,基于动态时间规整的最小距离选择演示中的动作,以利用未知源动态信息。该方法可简便应用于多种现有神经网络架构,在复制性能与泛化能力(即获取虚拟空间奖励的能力)之间取得平衡的强化学习模型。在追逐-逃脱与足球任务的实验中,当源环境与目标环境存在不同动态特性时,我们的方法相较于基线模型实现了复制性与泛化能力间的平衡。特别地,我们使用职业足球运动员的追踪数据作为足球任务中的专家演示,尽管源域与目标域行为差异大于追逐-逃脱任务,仍取得了成功表现。