This note clarifies some confusions (and perhaps throws out more) around model-based reinforcement learning and their theoretical understanding in the context of deep RL. Main topics of discussion are (1) how to reconcile model-based RL's bad empirical reputation on error compounding with its superior theoretical properties, and (2) the limitations of empirically popular losses. For the latter, concrete counterexamples for the "MuZero loss" are constructed to show that it not only fails in stochastic environments, but also suffers exponential sample complexity in deterministic environments when data provides sufficient coverage.
翻译:本注释澄清了(或许也引发了更多)关于基于模型的强化学习及其在深度强化学习背景下理论理解的一些困惑。主要讨论的主题包括:(1)如何调和基于模型的强化学习在误差累积方面的较差实证声誉与其优越的理论特性,以及(2)经验上流行的损失函数存在的局限性。针对后者,我们构造了具体的反例以证明“MuZero损失”不仅在随机环境中失效,而且在数据提供充分覆盖的确定性环境中也会遭受指数级样本复杂度。