We study the problem of controlling a partially observed Markov decision process (POMDP) to either aid or hinder the estimation of its state trajectory. We encode the estimation objectives via the smoother entropy, which is the conditional entropy of the state trajectory given measurements and controls. Consideration of the smoother entropy contrasts with previous approaches that instead resort to marginal (or instantaneous) state entropies due to tractability concerns. By establishing novel expressions for the smoother entropy in terms of the POMDP belief state, we show that both the problems of minimising and maximising the smoother entropy in POMDPs can surprisingly be reformulated as belief-state Markov decision processes with concave cost and value functions. The significance of these reformulations is that they render the smoother entropy a tractable optimisation objective, with structural properties amenable to the use of standard POMDP solution techniques for both active estimation and obfuscation. Simulations illustrate that optimisation of the smoother entropy leads to superior trajectory estimation and obfuscation compared to alternative approaches.
翻译:我们研究控制部分可观测马尔可夫决策过程(POMDP)以促进或阻碍其状态轨迹估计的问题。通过平滑熵(即给定测量和控制条件下状态轨迹的条件熵)来编码估计目标。与先前因可处理性考虑而采用边际(或瞬时)状态熵的方法不同,考虑平滑熵形成鲜明对比。通过建立基于POMDP信念状态的平滑熵新表达式,我们惊人地发现,POMDP中最小化和最大化平滑熵的问题均可重新表述为具有凹代价函数与值函数的信念状态马尔可夫决策过程。这些重新表述的意义在于,它们使平滑熵成为可处理的优化目标,其结构特性适用于对主动估计与混淆问题采用标准POMDP求解技术。仿真结果表明,与替代方法相比,优化平滑熵可带来更优的轨迹估计与混淆效果。