This paper looks at predictability problems, i.e., wherein an agent must choose its strategy in order to optimize the predictions that an external observer could make. We address these problems while taking into account uncertainties on the environment dynamics and on the observed agent's policy. To that end, we assume that the observer 1. seeks to predict the agent's future action or state at each time step, and 2. models the agent using a stochastic policy computed from a known underlying problem, and we leverage on the framework of observer-aware Markov decision processes (OAMDPs). We propose action and state predictability performance criteria through reward functions built on the observer's belief about the agent policy; show that these induced predictable OAMDPs can be represented by goal-oriented or discounted MDPs; and analyze the properties of the proposed reward functions both theoretically and empirically on two types of grid-world problems.
翻译:本文探讨可预测性问题,即智能体需选择其策略以优化外部观察者可能做出的预测。我们在考虑环境动态性和被观察智能体策略不确定性的前提下处理这些问题。为此,我们假设观察者:1. 试图在每个时间步预测智能体的未来动作或状态;2. 使用基于已知底层问题计算得到的随机策略对智能体进行建模,并借助观察者感知马尔可夫决策过程(OAMDPs)框架展开研究。我们通过构建基于观察者对智能体策略信念的奖励函数,提出动作与状态可预测性性能准则;证明这些诱导的可预测OAMDPs可表示为面向目标或折扣化的MDPs;并在两类网格世界问题上从理论与实证两方面分析所提出奖励函数的性质。