Estimating optimal dynamic policies from offline data is a fundamental problem in dynamic decision making. In the context of causal inference, the problem is known as estimating the optimal dynamic treatment regime. Even though there exists a plethora of methods for estimation, constructing confidence intervals for the value of the optimal regime and structural parameters associated with it is inherently harder, as it involves non-linear and non-differentiable functionals of unknown quantities that need to be estimated. Prior work resorted to sub-sample approaches that can deteriorate the quality of the estimate. We show that a simple soft-max approximation to the optimal treatment regime, for an appropriately fast growing temperature parameter, can achieve valid inference on the truly optimal regime. We illustrate our result for a two-period optimal dynamic regime, though our approach should directly extend to the finite horizon case. Our work combines techniques from semi-parametric inference and $g$-estimation, together with an appropriate triangular array central limit theorem, as well as a novel analysis of the asymptotic influence and asymptotic bias of softmax approximations.
翻译:从离线数据估计最优动态策略是动态决策中的基本问题。在因果推断领域,该问题被称为最优动态治疗方案估计。尽管存在大量估计方法,但为最优策略的价值及其相关结构参数构建置信区间本质上更为困难,因为这涉及对需要估计的未知量进行非线性不可微泛函分析。先前研究采用子样本方法,但可能降低估计质量。我们证明,对适当快速增长的温度参数,采用简单Softmax逼近最优治疗方案,可对真实最优策略实现有效推断。我们以两期最优动态策略为例说明该结果,但该方法应可直接推广至有限时域情形。本研究综合半参数推断与g估计技术,结合适当的三角阵列中心极限定理,并对Softmax逼近的渐近影响与渐近偏差进行了创新性分析。