Modeling of real-world biological multi-agents is a fundamental problem in various scientific and engineering fields. Reinforcement learning (RL) is a powerful framework to generate flexible and diverse behaviors in cyberspace; however, when modeling real-world biological multi-agents, there is a domain gap between behaviors in the source (i.e., real-world data) and the target (i.e., cyberspace for RL), and the source environment parameters are usually unknown. In this paper, we propose a method for adaptive action supervision in RL from real-world demonstrations in multi-agent scenarios. We adopt an approach that combines RL and supervised learning by selecting actions of demonstrations in RL based on the minimum distance of dynamic time warping for utilizing the information of the unknown source dynamics. This approach can be easily applied to many existing neural network architectures and provide us with an RL model balanced between reproducibility as imitation and generalization ability to obtain rewards in cyberspace. In the experiments, using chase-and-escape and football tasks with the different dynamics between the unknown source and target environments, we show that our approach achieved a balance between the reproducibility and the generalization ability compared with the baselines. In particular, we used the tracking data of professional football players as expert demonstrations in football and show successful performances despite the larger gap between behaviors in the source and target environments than the chase-and-escape task.
翻译:对真实世界生物多智能体进行建模是多个科学与工程领域的基础问题。强化学习是在虚拟空间中生成灵活多样行为的强大框架;然而,当对真实世界生物多智能体进行建模时,源域(即真实世界数据)与目标域(即强化学习虚拟空间)的行为之间存在领域差异,且源环境参数通常未知。本文提出一种方法,用于在多智能体场景中基于真实世界演示实现强化学习的自适应动作监督。我们采用结合强化学习与监督学习的策略,通过基于动态时间规整的最小距离选择强化学习中的演示动作,以利用未知源动力学的信息。该方法可轻松应用于多种现有神经网络架构,并为我们提供一种在模仿的可复现性与获取虚拟空间奖励的泛化能力之间取得平衡的强化学习模型。在实验中,我们使用未知源环境与目标环境具有不同动力学的追逃任务及足球任务,验证了该方法相较于基线在可复现性与泛化能力之间实现了平衡。特别地,我们利用职业足球运动员的追踪数据作为专家演示,尽管源域与目标域的行为差异较追逃任务更大,仍成功展现了模型性能。