Soft actor-critic is a successful successor over soft Q-learning. While lived under maximum entropy framework, their relationship is still unclear. In this paper, we prove that in the limit they converge to the same solution. This is appealing since it translates the optimization from an arduous to an easier way. The same justification can also be applied to other regularizers such as KL divergence.
翻译:软演员-评论家算法是软Q学习的成功演进。尽管两者均基于最大熵框架,但其间关系仍不明确。本文证明在极限条件下,二者收敛于相同解。这一结论颇具吸引力,因其将优化过程从繁复路径转化为更简洁的求解方式。该论证同样适用于KL散度等其他正则化方法。