Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?

Exogenous MDPs (Exo-MDPs) capture sequential decision-making where uncertainty comes solely from exogenous inputs that evolve independently of the learner's actions. This structure is especially common in operations research applications such as inventory control, energy storage, and resource allocation, where exogenous randomness (e.g., demand, arrivals, or prices) drives system behavior. Despite decades of empirical evidence that greedy, exploitation-only methods work remarkably well in these settings, theory has lagged behind: all existing regret guarantees for Exo-MDPs rely on explicit exploration or tabular assumptions. We show that exploration is unnecessary. We propose Pure Exploitation Learning (PEL) and prove the first general finite-sample regret bounds for exploitation-only algorithms in Exo-MDPs. In the tabular case, PEL achieves $\widetilde{O}(H^2|Ξ|\sqrt{K})$. For large, continuous endogenous state spaces, we introduce LSVI-PE, a simple linear-approximation method whose regret is polynomial in the feature dimension, exogenous state space, and horizon, independent of the endogenous state and action spaces. Our analysis introduces two new tools: counterfactual trajectories and Bellman-closed feature transport, which together allow greedy policies to have accurate value estimates without optimism. Experiments on synthetic and resource-management tasks show that PEL consistently outperforming baselines. Overall, our results overturn the conventional wisdom that exploration is required, demonstrating that in Exo-MDPs, pure exploitation is enough.

翻译：外生马尔可夫决策过程（Exo-MDPs）刻画了不确定性完全源于外生输入、且这些输入独立于学习者行动而演化的序列决策问题。这一结构在运筹学应用中尤为常见，例如库存控制、储能管理与资源分配，其中外生随机性（如需求、到达过程或价格）驱动着系统行为。尽管数十年的实证证据表明，在这些场景中纯贪心的仅利用方法表现非常出色，但相关理论却一直滞后：现有所有关于Exo-MDP的遗憾界分析均依赖于显式探索或表格化假设。本文证明探索是不必要的。我们提出了纯利用学习（PEL）方法，并首次为Exo-MDP中仅利用算法建立了通用的有限样本遗憾界。在表格化情形下，PEL达到$\widetilde{O}(H^2|Ξ|\sqrt{K})$的遗憾界。针对大规模连续内生状态空间，我们提出了LSVI-PE——一种简单的线性逼近方法，其遗憾界在特征维度、外生状态空间与时间跨度上呈多项式关系，且与内生状态空间及行动空间无关。我们的分析引入了两种新工具：反事实轨迹与贝尔曼闭包特征迁移，二者共同使得贪心策略能在无需乐观估计的情况下获得准确的价值评估。在合成任务与资源管理任务上的实验表明，PEL始终优于基线方法。总体而言，我们的研究颠覆了“必须进行探索”的传统认知，证明了在Exo-MDP中，纯利用策略已足够。