In average reward Markov decision processes, state-of-the-art algorithms for regret minimization follow a well-established framework: They are model-based, optimistic and episodic. First, they maintain a confidence region from which optimistic policies are computed using a well-known subroutine called Extended Value Iteration (EVI). Second, these policies are used over time windows called episodes, each ended by the Doubling Trick (DT) rule or a variant thereof. In this work, without modifying EVI, we show that there is a significant advantage in replacing (DT) by another simple rule, that we call the Vanishing Multiplicative (VM) rule. When managing episodes with (VM), the algorithm's regret is, both in theory and in practice, as good if not better than with (DT), while the one-shot behavior is greatly improved. More specifically, the management of bad episodes (when sub-optimal policies are being used) is much better under (VM) than (DT) by making the regret of exploration logarithmic rather than linear. These results are made possible by a new in-depth understanding of the contrasting behaviors of confidence regions during good and bad episodes.
翻译:在平均奖励马尔可夫决策过程中,最先进的遗憾最小化算法遵循一个成熟的框架:它们是基于模型的、乐观的且分段的。首先,它们维护一个置信区域,并利用一个称为扩展值迭代(EVI)的著名子程序从中计算乐观策略。其次,这些策略在称为片段的时间窗口中使用,每个片段由倍增技巧(DT)规则或其变体结束。在本工作中,我们未修改EVI,但证明了用另一种简单规则——我们称之为渐近乘性(VM)规则——替代(DT)具有显著优势。当使用(VM)管理片段时,该算法的遗憾在理论和实践上均不逊于甚至优于使用(DT)的情况,同时其单次行为得到极大改善。具体而言,在不良片段(即使用次优策略时)的管理上,(VM)远优于(DT),使得探索遗憾呈对数级而非线性增长。这些成果的实现得益于对置信区域在良好与不良片段中对比行为的全新深入理解。