Deceptive Reinforcement Learning in Model-Free Domains

This paper investigates deceptive reinforcement learning for privacy preservation in model-free and continuous action space domains. In reinforcement learning, the reward function defines the agent's objective. In adversarial scenarios, an agent may need to both maximise rewards and keep its reward function private from observers. Recent research presented the ambiguity model (AM), which selects actions that are ambiguous over a set of possible reward functions, via pre-trained $Q$-functions. Despite promising results in model-based domains, our investigation shows that AM is ineffective in model-free domains due to misdirected state space exploration. It is also inefficient to train and inapplicable in continuous action space domains. We propose the deceptive exploration ambiguity model (DEAM), which learns using the deceptive policy during training, leading to targeted exploration of the state space. DEAM is also applicable in continuous action spaces. We evaluate DEAM in discrete and continuous action space path planning environments. DEAM achieves similar performance to an optimal model-based version of AM and outperforms a model-free version of AM in terms of path cost, deceptiveness and training efficiency. These results extend to the continuous domain.

翻译：本文研究了在无模型且连续动作空间领域内，用于隐私保护的欺骗性强化学习。在强化学习中，奖励函数定义了智能体的目标。在对抗性场景下，智能体可能既需要最大化奖励，又需要对其奖励函数向观察者保持私密。近期研究提出了模糊模型（AM），该模型通过预训练的$Q$函数选择对一组可能奖励函数具有模糊性的动作。尽管在基于模型的领域中取得了有希望的结果，但我们的研究表明，由于状态空间探索方向的错误，AM在无模型领域中效果不佳。此外，AM训练效率低下且不适用于连续动作空间领域。我们提出了欺骗性探索模糊模型（DEAM），该模型在训练过程中利用欺骗性策略进行学习，从而实现针对性的状态空间探索。DEAM同样适用于连续动作空间。我们在离散和连续动作空间路径规划环境中评估了DEAM。DEAM在路径成本、欺骗性和训练效率方面达到了与最优基于模型的AM版本相似的性能，并优于无模型的AM版本。这些结果可推广至连续领域。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【硬核书】深度强化学习实践手册：应用现代RL方法，包括深度Q网络、值迭代、策略梯度、TRPO、AlphaGo等，547页pdf

专知会员服务

80+阅读 · 2022年12月11日

JCIM丨DRlinker：深度强化学习优化片段连接设计

专知会员服务

7+阅读 · 2022年12月9日

【AI+商业投资】法国兴业银行《深度强化学习在投资组合分配中的应用》26页PPT，Deep Reinforcement Learning for portfolio allocation

专知会员服务

24+阅读 · 2022年4月1日

【2022新书】强化学习工业应用，408页pdf

专知会员服务

232+阅读 · 2022年2月3日