We consider the Inverse Optimal Stopping (IOS) problem where, based on stopped expert trajectories, one aims to recover the optimal stopping region through the continuation and stopping gain functions approximation. The uniqueness of the stopping region allows the use of IOS in real-world applications with safety concerns. Although current state-of-the-art inverse reinforcement learning methods recover both a Q-function and the corresponding optimal policy, they fail to account for specific challenges posed by optimal stopping problems. These include data sparsity near the stopping region, the non-Markovian nature of the continuation gain, a proper treatment of boundary conditions, the need for a stable offline approach for risk-sensitive applications, and a lack of a quality evaluation metric. These challenges are addressed with the proposed Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping (DO-IQS), which incorporates temporal information by approximating the cumulative continuation gain together with the world dynamics and the Q-function without querying to the environment. In addition, a confidence-based oversampling approach is proposed to treat the data sparsity problem. We demonstrate the performance of our models on real and artificial data including an optimal intervention for the critical events problem.
翻译:本文研究逆最优停止问题,其目标是通过基于已停止专家轨迹的数据,近似连续增益函数与停止增益函数,从而恢复最优停止区域。停止区域的唯一性使得逆最优停止方法可应用于具有安全性考量的现实场景。尽管当前最先进的逆强化学习方法能够同时恢复Q函数及对应的最优策略,但这些方法未能充分考虑最优停止问题所特有的挑战。这些挑战包括:停止区域附近的数据稀疏性、连续增益函数的非马尔可夫特性、边界条件的恰当处理、风险敏感应用中对稳定离线方法的需求,以及缺乏有效的质量评估指标。本文提出的动态感知离线逆Q学习最优停止方法通过同时近似累积连续增益、世界动力学及Q函数(无需与环境交互)来融合时序信息,从而应对上述挑战。此外,我们提出一种基于置信度的过采样方法以处理数据稀疏问题。我们在真实与人工数据上验证了模型性能,包括针对关键事件问题的最优干预实验。