In an adversarial environment, a hostile player performing a task may behave like a non-hostile one in order not to reveal its identity to an opponent. To model such a scenario, we define identity concealment games: zero-sum stochastic reachability games with a zero-sum objective of identity concealment. To measure the identity concealment of the player, we introduce the notion of an average player. The average player's policy represents the expected behavior of a non-hostile player. We show that there exists an equilibrium policy pair for every identity concealment game and give the optimality equations to synthesize an equilibrium policy pair. If the player's opponent follows a non-equilibrium policy, the player can hide its identity better. For this reason, we study how the hostile player may learn the opponent's policy. Since learning via exploration policies would quickly reveal the hostile player's identity to the opponent, we consider the problem of learning a near-optimal policy for the hostile player using the game runs collected under the average player's policy. Consequently, we propose an algorithm that provably learns a near-optimal policy and give an upper bound on the number of sample runs to be collected.
翻译:在对抗环境中,执行任务的非友善型玩家可能模仿友善型玩家的行为,以避免向对手暴露其身份。为对此类场景建模,我们定义了身份隐匿博弈:一类具有零和身份隐匿目标的零和随机可达博弈。为衡量玩家的身份隐匿程度,我们引入平均玩家概念,其策略代表友善型玩家的期望行为。我们证明每个身份隐匿博弈均存在均衡策略对,并给出合成均衡策略对的最优性方程。若玩家对手采用非均衡策略,玩家可更有效地隐藏身份。为此,我们研究非友善型玩家如何学习对手策略。由于基于探索策略的学习会迅速向对手暴露非友善型玩家的身份,我们考虑利用平均玩家策略下收集的博弈过程数据,为恶意玩家学习近最优策略的问题。据此,我们提出一种可证明学习近最优策略的算法,并给出需收集样本过程数量的上界。