Structured reinforcement learning leverages policies with advantageous properties to reach better performance, particularly in scenarios where exploration poses challenges. We explore this field through the concept of orchestration, where a (small) set of expert policies guides decision-making; the modeling thereof constitutes our first contribution. We then establish value-functions regret bounds for orchestration in the tabular setting by transferring regret-bound results from adversarial settings. We generalize and extend the analysis of natural policy gradient in Agarwal et al. [2021, Section 5.3] to arbitrary adversarial aggregation strategies. We also extend it to the case of estimated advantage functions, providing insights into sample complexity both in expectation and high probability. A key point of our approach lies in its arguably more transparent proofs compared to existing methods. Finally, we present simulations for a stochastic matching toy model.
翻译:结构化强化学习利用具有有利属性的策略来达到更优性能,尤其在探索具有挑战性的场景中。我们通过编排概念探索这一领域——其中一小组专家策略指导决策过程,对该概念的建模构成了我们的首要贡献。随后,我们通过迁移对抗设置中的遗憾界结果,建立了表格环境下编排的值函数遗憾界。我们将Agarwal等人[2021,第5.3节]中自然策略梯度的分析推广至任意对抗聚合策略,并扩展至估计优势函数情形,提供了期望值和高概率下的样本复杂度分析。该方法的一个关键优势在于其证明过程相较于现有方法具有更明显的透明度。最后,我们针对随机匹配玩具模型进行了仿真实验。