Asymmetric information stochastic games (AISGs) arise in many complex socio-technical systems, such as cyber-physical systems and IT infrastructures. Existing computational methods for AISGs are primarily offline and can not adapt to equilibrium deviations. Further, current methods are limited to particular information structures to avoid belief hierarchies. Considering these limitations, we propose conjectural online learning (COL), an online learning method under generic information structures in AISGs. COL uses a forecaster-actor-critic (FAC) architecture, where subjective forecasts are used to conjecture the opponents' strategies within a lookahead horizon, and Bayesian learning is used to calibrate the conjectures. To adapt strategies to nonstationary environments based on information feedback, COL uses online rollout with cost function approximation (actor-critic). We prove that the conjectures produced by COL are asymptotically consistent with the information feedback in the sense of a relaxed Bayesian consistency. We also prove that the empirical strategy profile induced by COL converges to the Berk-Nash equilibrium, a solution concept characterizing rationality under subjectivity. Experimental results from an intrusion response use case demonstrate COL's {faster convergence} over state-of-the-art reinforcement learning methods against nonstationary attacks.
翻译:非对称信息随机博弈(AISGs)广泛存在于许多复杂的社会技术系统中,例如信息物理系统和IT基础设施。现有的AISGs计算方法主要是离线的,无法适应均衡偏差。此外,当前方法局限于特定的信息结构以避免信念层级问题。考虑到这些局限性,我们提出了推测在线学习(COL),这是一种适用于AISGs中通用信息结构的在线学习方法。COL采用预测器-执行器-评估器(FAC)架构,其中使用主观预测来推测对手在向前预测范围内的策略,并利用贝叶斯学习来校准这些推测。为了使策略能基于信息反馈适应非平稳环境,COL采用了带有成本函数逼近(执行器-评估器)的在线滚动方法。我们证明,COL产生的推测在松弛贝叶斯一致性的意义上,与信息反馈是渐近一致的。我们还证明,COL诱导的经验策略剖面收敛于伯克-纳什均衡,这是一种在主观性下刻画理性行为的解概念。来自入侵响应用例的实验结果表明,针对非平稳攻击,COL相比最先进的强化学习方法具有更快的收敛速度。