This paper studies a long-term resource allocation problem over multiple periods where each period requires a multi-stage decision-making process. We formulate the problem as an online resource allocation problem in an episodic finite-horizon Markov decision process with unknown non-stationary transitions and stochastic non-stationary reward and resource consumption functions for each episode. We provide an equivalent online linear programming reformulation based on occupancy measures, for which we develop an online mirror descent algorithm. Our online dual mirror descent algorithm for resource allocation deals with uncertainties and errors in estimating the true feasible set, which is of independent interest. We prove that under stochastic reward and resource consumption functions, the expected regret of the online mirror descent algorithm is bounded by $O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$ where $\rho\in(0,1)$ is the budget parameter, $H$ is the length of the horizon, $S$ and $A$ are the numbers of states and actions, and $T$ is the number of episodes.
翻译:本文研究了跨多个周期的长期资源分配问题,每个周期涉及多阶段决策过程。我们将该问题建模为具有未知非平稳转移函数及每周期随机非平稳奖励与资源消耗函数的片段式有限时域马尔可夫决策过程中的在线资源分配问题。基于占用度量方法,我们构建了等价的在线线性规划重构形式,并据此开发了在线镜像下降算法。所提出的面向资源分配的在线对偶镜像下降算法能有效应对真实可行集估计中的不确定性与误差,这一技术路线本身具有独立研究价值。我们证明:在随机奖励与资源消耗函数设定下,该在线镜像下降算法的期望遗憾界为 $O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$,其中 $\rho\in(0,1)$ 为预算参数,$H$ 为时域长度,$S$ 和 $A$ 分别为状态数与动作数,$T$ 为片段总数。