Reinforcement learning often needs to deal with the exponential growth of states and actions when exploring optimal control in high-dimensional spaces (often known as the curse of dimensionality). In this work, we address this issue by learning the inherent structure of action-wise similar MDP to appropriately balance the performance degradation versus sample/computational complexity. In particular, we partition the action spaces into multiple groups based on the similarity in transition distribution and reward function, and build a linear decomposition model to capture the difference between the intra-group transition kernel and the intra-group rewards. Both our theoretical analysis and experiments reveal a \emph{surprising and counter-intuitive result}: while a more refined grouping strategy can reduce the approximation error caused by treating actions in the same group as identical, it also leads to increased estimation error when the size of samples or the computation resources is limited. This finding highlights the grouping strategy as a new degree of freedom that can be optimized to minimize the overall performance loss. To address this issue, we formulate a general optimization problem for determining the optimal grouping strategy, which strikes a balance between performance loss and sample/computational complexity. We further propose a computationally efficient method for selecting a nearly-optimal grouping strategy, which maintains its computational complexity independent of the size of the action space.
翻译:强化学习在处理高维空间中的最优控制探索时,常需应对状态与动作的指数级增长(即“维度灾难”)。本文通过学习动作相似马尔可夫决策过程的内在结构,在性能损失与样本/计算复杂度之间进行恰当权衡,以解决该问题。具体而言,我们依据转移分布与奖励函数的相似性将动作空间划分为若干组,并建立线性分解模型以捕捉组内转移核与组内奖励的差异。理论分析与实验均揭示了一个“令人惊讶且反直觉的结果”:更精细的分组策略虽能降低将同组动作视为相同所引发的近似误差,但在样本规模或计算资源受限时,反而会导致估计误差增大。这一发现凸显了分组策略可作为优化整体性能损失的新自由度。为此,我们提出一个通用优化问题以确定最优分组策略,在性能损失与样本/计算复杂度之间寻求平衡。进一步,我们设计了一种计算高效的方法来选取近似最优分组策略,其计算复杂度与动作空间规模无关。