Humans performing tasks that involve taking a series of multiple dependent actions over time often learn from experience by reflecting on specific cases and points in time, where different actions could have led to significantly better outcomes. While recent machine learning methods to retrospectively analyze sequential decision making processes promise to aid decision makers in identifying such cases, they have focused on environments with finitely many discrete states. However, in many practical applications, the state of the environment is inherently continuous in nature. In this paper, we aim to fill this gap. We start by formally characterizing a sequence of discrete actions and continuous states using finite horizon Markov decision processes and a broad class of bijective structural causal models. Building upon this characterization, we formalize the problem of finding counterfactually optimal action sequences and show that, in general, we cannot expect to solve it in polynomial time. Then, we develop a search method based on the $A^*$ algorithm that, under a natural form of Lipschitz continuity of the environment's dynamics, is guaranteed to return the optimal solution to the problem. Experiments on real clinical data show that our method is very efficient in practice, and it has the potential to offer interesting insights for sequential decision making tasks.
翻译:人类在执行涉及一系列随时间推移的多个依赖动作的任务时,常常通过反思具体案例和关键时间点来从经验中学习——即思考在不同动作下本可能取得显著更优结果的情形。尽管近期用于追溯分析序贯决策过程的机器学习方法有望帮助决策者识别此类情形,但这些方法主要聚焦于有限离散状态的环境。然而在许多实际应用中,环境状态本质上具有连续性。本文旨在填补这一空白。我们首先利用有限马尔可夫决策过程与一大类双射结构因果模型,对离散动作序列与连续状态进行形式化刻画。基于这一框架,我们形式化定义了"寻找反事实最优动作序列"问题,并证明该问题在一般情况下无法在多项式时间内求解。随后我们提出一种基于$A^*$算法的搜索方法,在环境动态满足自然形式的利普希茨连续性条件下,该方法能保证返回问题的最优解。基于真实临床数据的实验表明,该方法在实践中非常高效,有望为序贯决策任务提供有价值的见解。