We propose an (offline) multi-dimensional distributional reinforcement learning framework (KE-DRL) that leverages Hilbert space mappings to estimate the kernel mean embedding of the multi-dimensional value distribution under a proposed target policy. In our setting, the state-action variables are multi-dimensional and continuous. By mapping probability measures into a reproducing kernel Hilbert space via kernel mean embeddings, our method replaces Wasserstein metrics with an integral probability metric. This enables efficient estimation in multi-dimensional state-action spaces and reward settings, where direct computation of Wasserstein distances is computationally challenging. Theoretically, we establish contraction properties of the distributional Bellman operator under our proposed metric involving the Matern family of kernels and provide uniform convergence guarantees. Simulations and empirical results demonstrate robust off-policy evaluation and recovery of the kernel mean embedding under mild assumptions, namely, Lipschitz continuity and boundedness of the kernels, highlighting the potential of embedding-based approaches in complex real-world decision-making scenarios and risk evaluation.
翻译:我们提出了一种(离线)多维分布强化学习框架(KE-DRL),该框架利用希尔伯特空间映射来估计目标策略下多维价值分布的核均值嵌入。在我们的设定中,状态-动作变量是多维且连续的。通过核均值嵌入将概率测度映射到再生核希尔伯特空间,我们的方法用积分概率度量替代了Wasserstein度量。这使得在多维状态-动作空间和奖励设置中能够进行高效估计,而在这些场景中直接计算Wasserstein距离在计算上具有挑战性。理论上,我们建立了所提度量下分布贝尔曼算子的压缩性质(涉及Matern核族),并提供了均匀收敛性保证。仿真和实证结果表明,在温和假设(即核的Lipschitz连续性和有界性)下,能够实现稳健的离策略评估和核均值嵌入的恢复,凸显了基于嵌入的方法在复杂现实世界决策场景和风险评估中的潜力。