We study a class of structured Markov Decision Processes (MDPs) known as Exo-MDPs, characterized by a partition of the state space into two components. The exogenous states evolve stochastically in a manner not affected by the agent's actions, whereas the endogenous states are affected by the actions, and evolve in a deterministic and known way conditional on the exogenous states. Exo-MDPs are a natural model for various applications including inventory control, finance, power systems, ride sharing, among others. Despite seeming restrictive, this work establishes that any discrete MDP can be represented as an Exo-MDP. Further, Exo-MDPs induce a natural representation of the transition and reward dynamics as linear functions of the exogenous state distribution. This linear representation leads to near-optimal algorithms with regret guarantees scaling only with the (effective) size of the exogenous state space $d$, independent of the sizes of the endogenous state and action spaces. Specifically, when the exogenous state is fully observed, a simple plug-in approach achieves a regret upper bound of $O(H^{3/2}\sqrt{dK})$, where $H$ denotes the horizon and $K$ denotes the total number of episodes. When the exogenous state is unobserved, the linear representation leads to a regret upper bound of $O(H^{3/2}d\sqrt{K})$. We also establish a nearly matching regret lower bound of $\Omega(Hd\sqrt{K})$ for the no observation regime. We complement our theoretical findings with an experimental study on inventory control problems.
翻译:我们研究一类被称为外生马尔可夫决策过程的结构化MDP,其特点是将状态空间划分为两个组成部分。外生状态以随机方式演化,不受智能体行为影响;而内生状态受行为影响,且在外生状态条件下以确定且已知的方式演化。外生MDP是库存控制、金融、电力系统、网约车等多种应用领域的自然模型。尽管看似受限,本研究表明任何离散MDP均可表示为外生MDP。此外,外生MDP能够将转移和奖励动态自然地表示为外生状态分布的线性函数。这种线性表示催生了近乎最优的算法,其遗憾保证仅随外生状态空间(有效)尺寸$d$缩放,而与内生状态和行为空间的尺寸无关。具体而言,当外生状态被完全观测时,简单的插件方法可实现$O(H^{3/2}\sqrt{dK})$的遗憾上界,其中$H$表示决策步长,$K$表示总回合数。当外生状态未被观测时,线性表示可导出$O(H^{3/2}d\sqrt{K})$的遗憾上界。我们同时建立了无观测机制下近乎匹配的遗憾下界$\Omega(Hd\sqrt{K})$。我们通过在库存控制问题上的实验研究对理论发现进行了补充验证。