In average reward Markov decision processes, state-of-the-art algorithms for regret minimization follow a well-established framework: They are model-based, optimistic and episodic. First, they maintain a confidence region from which optimistic policies are computed using a well-known subroutine called Extended Value Iteration (EVI). Second, these policies are used over time windows called episodes, each ended by the Doubling Trick (DT) rule or a variant thereof. In this work, without modifying EVI, we show that there is a significant advantage in replacing (DT) by another simple rule, that we call the Vanishing Multiplicative (VM) rule. When managing episodes with (VM), the algorithm's regret is, both in theory and in practice, as good if not better than with (DT), while the one-shot behavior is greatly improved. More specifically, the management of bad episodes (when sub-optimal policies are being used) is much better under (VM) than (DT) by making the regret of exploration logarithmic rather than linear. These results are made possible by a new in-depth understanding of the contrasting behaviors of confidence regions during good and bad episodes.
翻译:在平均奖励马尔可夫决策过程中,用于遗憾最小化的最先进算法遵循一个成熟的框架:它们是基于模型的、乐观的且分段的。首先,算法维护一个置信区域,并通过一个称为扩展值迭代(EVI)的已知子程序计算乐观策略。其次,这些策略在称为段的时间窗口内使用,每个段由倍增技巧(DT)规则或其变体结束。在本工作中,我们未修改EVI,而是展示了用另一个简单规则——我们称之为消失乘性(VM)规则——替代(DT)具有显著优势。当使用(VM)管理段时,算法的遗憾在理论和实践中均不逊于甚至优于(DT),同时单次行为得到极大改善。具体而言,在(VM)规则下,对不良段(当使用次优策略时)的管理远优于(DT),使得探索的遗憾呈对数而非线性增长。这些结果得益于对置信区域在良好段和不良段中对比行为的新深入理解。