Long-run average optimization problems for Markov decision processes (MDPs) require constructing policies with optimal steady-state behavior, i.e., optimal limit frequency of visits to the states. However, such policies may suffer from local instability, i.e., the frequency of states visited in a bounded time horizon along a run differs significantly from the limit frequency. In this work, we propose an efficient algorithmic solution to this problem.
翻译:马尔可夫决策过程(MDP)的长期平均优化问题需要构建具有最优稳态行为的策略,即状态访问的最优极限频率。然而,此类策略可能存在局部不稳定性:在有限时间范围内,轨迹中状态访问的频率与极限频率之间存在显著差异。本文针对该问题提出了一种高效的算法解决方案。