Our work addresses a fundamental problem in the context of counterfactual inference for Markov Decision Processes (MDPs). Given an MDP path $\tau$, this kind of inference allows us to derive counterfactual paths $\tau'$ describing what-if versions of $\tau$ obtained under different action sequences than those observed in $\tau$. However, as the counterfactual states and actions deviate from the observed ones over time, the observation $\tau$ may no longer influence the counterfactual world, meaning that the analysis is no longer tailored to the individual observation, resulting in interventional outcomes rather than counterfactual ones. Even though this issue specifically affects the popular Gumbel-max structural causal model used for MDP counterfactuals, it has remained overlooked until now. In this work, we introduce a formal characterisation of influence based on comparing counterfactual and interventional distributions. We devise an algorithm to construct counterfactual models that automatically satisfy influence constraints. Leveraging such models, we derive counterfactual policies that are not just optimal for a given reward structure but also remain tailored to the observed path. Even though there is an unavoidable trade-off between policy optimality and strength of influence constraints, our experiments demonstrate that it is possible to derive (near-)optimal policies while remaining under the influence of the observation.
翻译:我们的工作解决了马尔可夫决策过程(MDPs)中反事实推理的一个基本问题。给定一条MDP路径$\tau$,此类推理使我们能够推导出反事实路径$\tau'$,这些路径描述了在不同于$\tau$中观察到的动作序列下获得的$\tau$的“假设”版本。然而,由于反事实状态和动作随时间推移偏离观察到的状态和动作,观察结果$\tau$可能不再影响反事实世界,这意味着分析不再针对特定观察结果量身定制,从而导致干预性结果而非反事实结果。尽管这个问题特别影响用于MDP反事实的流行Gumbel-max结构因果模型,但直到现在它仍被忽视。在这项工作中,我们基于比较反事实分布和干预分布,引入了影响的形式化刻画。我们设计了一种算法来构建自动满足影响约束的反事实模型。利用此类模型,我们推导出反事实策略,这些策略不仅对给定奖励结构最优,而且仍然针对观察到的路径量身定制。尽管策略最优性与影响约束强度之间存在不可避免的权衡,但我们的实验表明,在保持观察结果影响的同时,推导出(接近)最优策略是可能的。