Stochastic Optimal Control provides a unified mathematical framework for solving complex decision-making problems, encompassing paradigms such as maximum entropy reinforcement learning(RL) and imitation learning(IL). However, conventional parametric policies often struggle to represent the multi-modality of the solutions. Though diffusion-based policies are aimed at recovering the multi-modality, they lack an explicit probability density, which complicates policy-gradient optimization. To bridge this gap, we propose MePoly, a novel policy parameterization based on polynomial energy-based models. MePoly provides an explicit, tractable probability density, enabling exact entropy maximization. Theoretically, we ground our method in the classical moment problem, leveraging the universal approximation capabilities for arbitrary distributions. Empirically, we demonstrate that MePoly effectively captures complex non-convex manifolds and outperforms baselines in performance across diverse benchmarks.
翻译:随机最优控制为解决复杂决策问题提供了统一的数学框架,涵盖最大熵强化学习与模仿学习等范式。然而,传统参数化策略往往难以表征解的多模态特性。尽管基于扩散的策略旨在恢复多模态,但其缺乏显式概率密度,这使策略梯度优化变得复杂。为弥合这一差距,我们提出MePoly——一种基于多项式能量模型的新型策略参数化方法。MePoly提供显式、易处理的概率密度,从而实现精确的熵最大化。理论上,我们将该方法奠基于经典矩问题,利用其对任意分布的通用逼近能力。实证研究表明,MePoly能有效捕捉复杂的非凸流形,并在多种基准测试中超越基线方法的性能。