Diffusion-based models have achieved notable empirical successes in reinforcement learning (RL) due to their expressiveness in modeling complex distributions. Despite existing methods being promising, the key challenge of extending existing methods for broader real-world applications lies in the computational cost at inference time, i.e., sampling from a diffusion model is considerably slow as it often requires tens to hundreds of iterations to generate even one sample. To circumvent this issue, we propose to leverage the flexibility of diffusion models for RL from a representation learning perspective. In particular, by exploiting the connection between diffusion models and energy-based models, we develop Diffusion Spectral Representation (Diff-SR), a coherent algorithm framework that enables extracting sufficient representations for value functions in Markov decision processes (MDP) and partially observable Markov decision processes (POMDP). We further demonstrate how Diff-SR facilitates efficient policy optimization and practical algorithms while explicitly bypassing the difficulty and inference cost of sampling from the diffusion model. Finally, we provide comprehensive empirical studies to verify the benefits of Diff-SR in delivering robust and advantageous performance across various benchmarks with both fully and partially observable settings.
翻译:基于扩散的模型因其在建模复杂分布方面的强大表达能力,已在强化学习(RL)领域取得了显著的实证成功。尽管现有方法前景广阔,但将其扩展至更广泛现实应用的关键挑战在于推理时的计算成本,即从扩散模型中采样速度相当缓慢,通常需要数十至数百次迭代才能生成单个样本。为规避此问题,我们提出从表示学习的角度利用扩散模型的灵活性进行强化学习。具体而言,通过利用扩散模型与基于能量的模型之间的关联,我们开发了扩散谱表示(Diff-SR),这是一个连贯的算法框架,能够为马尔可夫决策过程(MDP)和部分可观测马尔可夫决策过程(POMDP)中的价值函数提取充分的表示。我们进一步展示了Diff-SR如何促进高效策略优化及实用算法,同时显式地规避了从扩散模型中采样的困难与推理成本。最后,我们通过全面的实证研究验证了Diff-SR在完全可观测与部分可观测的多种基准测试中,均能提供鲁棒且优越的性能表现。