We introduce a novel cyclic Markov decision process (MDP) framework for multi-step decision problems with heterogeneous stage-specific dynamics, transitions, and discount factors across the cycle. In this setting, offline learning is challenging: optimizing a policy at any stage shifts the state distributions of subsequent stages, propagating mismatch across the cycle. To address this, we propose a modular structural framework that decomposes the cyclic process into stage-wise sub-problems. While generally applicable, we instantiate this principle as CycleFQI, an extension of fitted Q-iteration enabling theoretical analysis and interpretation. It uses a vector of stage-specific Q-functions, tailored to each stage, to capture within-stage sequences and transitions between stages. This modular design enables partial control, allowing some stages to be optimized while others follow predefined policies. We establish finite-sample suboptimality error bounds and derive global convergence rates under Besov regularity, demonstrating that CycleFQI mitigates the curse of dimensionality compared to monolithic baselines. Additionally, we propose a sieve-based method for asymptotic inference of optimal policy values under a margin condition. Experiments on simulated and real-world Type 1 Diabetes data sets demonstrate CycleFQI's effectiveness.
翻译:本文针对具有异质阶段特定动态、转移和折扣因子的多步决策问题,提出了一种新颖的循环马尔可夫决策过程框架。在此设定下,离线学习面临挑战:优化任一阶段的策略会改变后续阶段的状态分布,导致不匹配性在循环中传播。为解决该问题,我们提出一种模块化结构框架,将循环过程分解为阶段性子问题。该原理具有普适性,我们将其具体实现为CycleFQI——一种支持理论分析与解释的拟合Q迭代扩展方法。该方法采用阶段特定的Q函数向量,每个函数针对相应阶段定制,以捕捉阶段内序列与阶段间转移。这种模块化设计支持部分控制,允许优化特定阶段而其他阶段遵循预定义策略。我们建立了有限样本次优性误差界,并在Besov正则性条件下推导出全局收敛速率,证明相较于整体基线方法,CycleFQI能够缓解维度灾难问题。此外,我们提出一种基于筛法的方法,用于在边界条件下对最优策略值进行渐近推断。在模拟数据和真实世界1型糖尿病数据集上的实验验证了CycleFQI的有效性。