We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.
翻译:我们提出了一种用于从离线数据中学习策略的交叉拟合去偏装置。由此产生的学习原理的一个关键结果是,即使对于复杂度大于Donsker的策略类,只要乘积误差扰动余项为$O(N^{-1/2})$,就能实现$\sqrt N$遗憾。遗憾边界可分解为由策略类复杂度决定的插件策略误差因子和由环境动态复杂度决定的环境扰动因子,明确揭示了两者之间如何相互权衡。