We consider Inverse Optimal Stopping (IOS) problem where, based on stopped expert trajectories, one aims to recover the optimal stopping region through continuation and stopping gain functions approximation. The uniqueness of the stopping region allows the use of IOS in real-world applications with safety concerns. While current state-of-the-art inverse reinforcement learning methods recover both a Q-function and the corresponding optimal policy, they fail to account for specific challenges posed by optimal stopping problems. These include data sparsity near the stopping region, non-Markovian nature of the continuation gain, a proper treatment of boundary conditions, the need for a stable offline approach for risk-sensitive applications, and a lack of a quality evaluation metric. These challenges are addressed with the proposed Dynamics-Aware Offline Inverse Q-Learning for Optimal Stopping (DO-IQS), which incorporates temporal information by approximating the cumulative continuation gain together with the world dynamics and the Q-function without querying to the environment. Moreover, a confidence-based oversampling approach is proposed to treat the data sparsity problem. We demonstrate the performance of our models on real and artificial data including an optimal intervention for critical events problem.
翻译:本文研究逆最优停止问题,其目标是通过基于已停止专家轨迹的延续与停止增益函数逼近来恢复最优停止区域。停止区域的唯一性使得逆最优停止方法可应用于具有安全性考量的现实场景。当前最先进的逆强化学习方法虽能同时恢复Q函数及其对应最优策略,但未能充分考虑最优停止问题所特有的挑战。这些挑战包括:停止区域附近的数据稀疏性、延续增益的非马尔可夫特性、边界条件的恰当处理、风险敏感应用对稳定离线方法的需求,以及缺乏有效的质量评估指标。本文提出的动态感知离线逆Q学习框架通过同时逼近累积延续增益、世界动力学模型及Q函数(无需与环境交互)来融合时序信息,从而应对上述挑战。此外,我们提出一种基于置信度的过采样方法以处理数据稀疏问题。通过在真实数据与仿真数据(包括关键事件最优干预问题)上的实验,验证了所提模型的性能。