This paper studies a long-term resource allocation problem over multiple periods where each period requires a multi-stage decision-making process. We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process with an unknown non-stationary transition function and stochastic non-stationary reward and resource consumption functions. We propose the observe-then-decide regime and improve the existing decide-then-observe regime, while the two settings differ in how the observations and feedback about the reward and resource consumption functions are given to the decision-maker. We develop an online dual mirror descent algorithm that achieves near-optimal regret bounds for both settings. For the observe-then-decide regime, we prove that the expected regret against the dynamic clairvoyant optimal policy is bounded by $\tilde O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$ where $\rho\in(0,1)$ is the budget parameter, $H$ is the length of the horizon, $S$ and $A$ are the numbers of states and actions, and $T$ is the number of episodes. For the decide-then-observe regime, we show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by $\tilde O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$ with high probability. We test the numerical efficiency of our method for a variant of the resource-constrained inventory management problem.
翻译:本文研究跨多个周期的长期资源分配问题,其中每个周期需执行多阶段决策过程。我们将该问题建模为具有未知非平稳转移函数及随机非平稳奖励与资源消耗函数的有限时段约束马尔可夫决策过程中的在线分配问题。我们提出了"先观测后决策"机制,并改进了现有的"先决策后观测"机制,两种设置的区别在于决策者获取奖励与资源消耗函数的观测与反馈方式不同。我们开发了一种在线对偶镜像下降算法,在两种设置下均能达到近最优的遗憾界。对于"先观测后决策"机制,我们证明相较于动态预见最优策略的期望遗憾界为$\tilde O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$,其中$\rho\in(0,1)$为预算参数,$H$为时域长度,$S$与$A$分别为状态数与动作数,$T$为回合数。对于"先决策后观测"机制,我们证明相较于能获取平均奖励与平均资源消耗函数的静态最优策略,其遗憾界大概率不超过$\tilde O(\rho^{-1}{H^{3/2}}S\sqrt{AT})$。我们通过资源约束库存管理问题的变体验证了所提方法的数值效率。