We propose a novel $K$-nearest neighbor resampling procedure for estimating the performance of a policy from historical data containing realized episodes of a decision process generated under a different policy. We focus on feedback policies that depend deterministically on the current state in environments with continuous state-action spaces and system-inherent stochasticity effected by chosen actions. Such settings are common in a wide range of high-stake applications and are actively investigated in the context of stochastic control. Our procedure exploits that similar state/action pairs (in a metric sense) are associated with similar rewards and state transitions. This enables our resampling procedure to tackle the counterfactual estimation problem underlying off-policy evaluation (OPE) by simulating trajectories similarly to Monte Carlo methods. Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization and does not explicitly assume a parametric model for the environment's dynamics. These properties make the proposed resampling algorithm particularly useful for stochastic control environments. We prove that our method is statistically consistent in estimating the performance of a policy in the OPE setting under weak assumptions and for data sets containing entire episodes rather than independent transitions. To establish the consistency, we generalize Stone's Theorem, a well-known result in nonparametric statistics on local averaging, to include episodic data and the counterfactual estimation underlying OPE. Numerical experiments demonstrate the effectiveness of the algorithm in a variety of stochastic control settings including a linear quadratic regulator, trade execution in limit order books and online stochastic bin packing.
翻译:我们提出了一种新颖的 $K$ 近邻重采样方法,用于从历史数据中估计策略的性能,这些历史数据包含由不同策略生成的决策过程的实际 episode。我们关注的是在具有连续状态-动作空间且所选动作导致系统固有随机性的环境中,决定性地依赖于当前状态的反馈策略。这类设置广泛存在于多种高风险管理应用中,并且是随机控制领域的研究热点。我们的方法利用了相似状态/动作对(在度量意义上)往往对应相似奖励和状态转移这一特性。这使得我们的重采样方法能够通过模拟轨迹(类似于蒙特卡洛方法)来解决离线策略评估(OPE)中的反事实估计问题。与其他 OPE 方法相比,我们的算法无需优化,可借助基于树的最近邻搜索和并行化高效实现,并且不显式假设环境动态的参数模型。这些特性使得所提出的重采样算法特别适用于随机控制环境。我们证明了在弱假设条件下,对于包含完整 episode(而非独立转移)的数据集,该方法在估计 OPE 设置中策略性能时具有统计一致性。为建立这一一致性,我们推广了非参数统计中关于局部平均的著名结果——Stone 定理,使其包含 episode 数据和 OPE 背后的反事实估计。数值实验展示了该算法在多种随机控制设置中的有效性,包括线性二次型调节器、限价订单簿中的交易执行以及在线随机装箱问题。