This paper investigates deceptive reinforcement learning for privacy preservation in model-free and continuous action space domains. In reinforcement learning, the reward function defines the agent's objective. In adversarial scenarios, an agent may need to both maximise rewards and keep its reward function private from observers. Recent research presented the ambiguity model (AM), which selects actions that are ambiguous over a set of possible reward functions, via pre-trained $Q$-functions. Despite promising results in model-based domains, our investigation shows that AM is ineffective in model-free domains due to misdirected state space exploration. It is also inefficient to train and inapplicable in continuous action space domains. We propose the deceptive exploration ambiguity model (DEAM), which learns using the deceptive policy during training, leading to targeted exploration of the state space. DEAM is also applicable in continuous action spaces. We evaluate DEAM in discrete and continuous action space path planning environments. DEAM achieves similar performance to an optimal model-based version of AM and outperforms a model-free version of AM in terms of path cost, deceptiveness and training efficiency. These results extend to the continuous domain.
翻译:本文研究了在无模型且连续动作空间领域内,用于隐私保护的欺骗性强化学习。在强化学习中,奖励函数定义了智能体的目标。在对抗性场景下,智能体可能既需要最大化奖励,又需要对其奖励函数向观察者保持私密。近期研究提出了模糊模型(AM),该模型通过预训练的$Q$函数选择对一组可能奖励函数具有模糊性的动作。尽管在基于模型的领域中取得了有希望的结果,但我们的研究表明,由于状态空间探索方向的错误,AM在无模型领域中效果不佳。此外,AM训练效率低下且不适用于连续动作空间领域。我们提出了欺骗性探索模糊模型(DEAM),该模型在训练过程中利用欺骗性策略进行学习,从而实现针对性的状态空间探索。DEAM同样适用于连续动作空间。我们在离散和连续动作空间路径规划环境中评估了DEAM。DEAM在路径成本、欺骗性和训练效率方面达到了与最优基于模型的AM版本相似的性能,并优于无模型的AM版本。这些结果可推广至连续领域。