We consider the problem of fitting a reinforcement learning (RL) model to some given behavioral data under a multi-armed bandit environment. These models have received much attention in recent years for characterizing human and animal decision making behavior. We provide a generic mathematical optimization problem formulation for the fitting problem of a wide range of RL models that appear frequently in scientific research applications. We then provide a detailed theoretical analysis of its convexity properties. Based on the theoretical results, we introduce a novel solution method for the fitting problem of RL models based on convex relaxation and optimization. Our method is then evaluated in several simulated and real-world bandit environments to compare with some benchmark methods that appear in the literature. Numerical results indicate that our method achieves comparable performance to the state-of-the-art, while significantly reducing computation time. We also provide an open-source Python package for our proposed method to empower researchers to apply it in the analysis of their datasets directly, without prior knowledge of convex optimization.
翻译:我们研究了多臂赌博机环境下,利用给定的行为数据拟合强化学习模型的问题。这类模型近年来在刻画人类及动物的决策行为方面备受关注。针对科学研究应用中频繁出现的多类强化学习模型,我们提供了拟合问题的通用数学优化问题形式化描述,并对其凸性性质进行了详细的理论分析。基于理论结果,我们提出了一种基于凸松弛与优化的强化学习模型拟合新方法。在多个模拟及真实赌博机场景中,我们将该方法与文献中出现的基准方法进行了对比评估。数值结果表明,本方法在显著降低计算时间的同时,实现了与现有最优方法相当的性能。我们还为所提方法提供了开源Python工具包,使研究者无需具备凸优化先验知识即可直接将其用于数据分析。